Welcome to Optimum-NVIDIA

Optimum-NVIDIA is the interface between the Hugging Face ecosystem and NVIDIA GPUs, designed to deliver the best possible inference performance. By leveraging NVIDIA TensorRT-LLM under the hood, it allows developers to run Large Language Models (LLMs) at significantly higher speeds—up to 28x faster than standard frameworks—often by changing just a single line of code.

This library provides a seamless integration for TensorRT-LLM, enabling you to use familiar Hugging Face APIs like from_pretrained() and pipeline() to load, convert, and run models optimized for NVIDIA hardware.

Why Optimum-NVIDIA?

  • Peak Performance: Unlock the full potential of your NVIDIA hardware, including Tensor Core GPUs with FP8 support on Hopper and Ada Lovelace architectures.
  • Ease of Use: Transition from transformers to optimum-nvidia with minimal code changes. The library handles the complex model conversion and engine building process automatically.
  • Hugging Face Hub Integration: Fetch and load optimized, pre-built TensorRT-LLM engines directly from the Hugging Face Hub, or let the library build them on the fly from a standard transformers model checkpoint.
  • Advanced Quantization: Easily apply advanced quantization techniques like FP8 and AWQ to reduce memory footprint and further accelerate inference, all while maintaining a simple, developer-friendly API.

Whether you're looking to speed up a local prototype or deploy a high-throughput inference service, Optimum-NVIDIA provides the tools to make it happen efficiently.

Key Features

  • High-Level APIs: Use AutoModelForCausalLM and pipeline for a familiar, transformers-like experience.
  • Automated Engine Building: The from_pretrained method intelligently handles the conversion of Hugging Face models to TensorRT-LLM engines.
  • FP8 Inference: Built-in support for FP8 quantization on compatible hardware (Hopper, Ada Lovelace) via the use_fp8=True flag.
  • CLI for Export: A powerful command-line interface to export models into standalone TensorRT-LLM engines for deployment.

Ready to get started? Head over to the Installation page.