Core Concepts

Understanding a few core concepts will help you get the most out of optimum-nvidia.

The Conversion Workflow: From Transformers to TensorRT-LLM

The primary role of optimum-nvidia is to bridge the gap between standard Hugging Face models and the high-performance TensorRT-LLM inference engine. This involves a conversion and build process that happens automatically when you call from_pretrained().

The workflow is as follows:

  1. Hugging Face Model: You start with a standard transformers model, either from the Hugging Face Hub or a local directory.

  2. TensorRT-LLM Checkpoint: The library first converts the model weights into an intermediate format known as a TensorRT-LLM checkpoint. This format is a directory containing the model's weights and configuration tailored for TensorRT-LLM. If you are applying quantization, this step also includes the calibration process.

  3. TensorRT-LLM Engine: The checkpoint is then used to build a highly optimized, hardware-specific inference engine (.engine file). This engine is compiled for your exact GPU architecture (e.g., Ada Lovelace, Hopper), which is why it's so fast. This step can take some time, but it only needs to be done once for a given model and hardware configuration.

The from_pretrained Magic

When you call AutoModelForCausalLM.from_pretrained(model_id, ...) a lot happens behind the scenes:

  1. Cache Check: The library first checks a local cache to see if a compatible TensorRT-LLM engine has already been built for your model and GPU.
  2. Hub Check: If no local engine is found, it queries the Hugging Face Hub for pre-built engines for your specific GPU architecture. Many popular models have pre-built engines available, saving you the compilation time.
  3. On-the-Fly Build: If no pre-built engine is found, optimum-nvidia downloads the original transformers model weights and builds the TensorRT-LLM engine on your machine. The resulting engine is then cached for future use.

This process ensures that you always get an optimized engine with minimal effort, whether it's pre-built or compiled just-in-time.

The Workspace

All artifacts generated by optimum-nvidia—including TensorRT-LLM checkpoints and engines—are stored in a dedicated workspace directory. This workspace is managed by huggingface_hub's caching system to ensure that builds are reusable.

You can locate this directory within your Hugging Face cache folder, typically at ~/.cache/huggingface/hub/assets/trtllm/. The path is structured to be unique for each model, TensorRT-LLM version, and target device.

For example, the workspace for Llama-2-7b on an NVIDIA H100 might look like this:

~/.cache/huggingface/hub/assets/trtllm/
v0.16.0/
meta-llama--Llama-2-7b-chat-hf/
NVIDIA-H100-PCIe/
├── checkpoints/
│   └── rank0.safetensors
└── engines/
    ├── config.json
    ├── generation_config.json
    └── rank0.engine

Understanding the workspace is useful when you want to manage cached builds or use the optimum-cli to export engines to a custom location.