Guide: Exporting Models with optimum-cli

While optimum-nvidia can build TensorRT-LLM engines on-the-fly, you may want to create standalone, portable engine files for deployment or distribution. The optimum-cli provides a powerful command for this purpose.

The export trtllm Command

The export command handles the full conversion pipeline—from a Hugging Face model to optimized TensorRT-LLM engines—and saves the output to a specified directory.

Basic Usage

The basic syntax is:

optimum-cli export trtllm <model_id> [options]

For example, to export meta-llama/Llama-2-7b-chat-hf to a local folder named ./llama-2-trtllm:

optimum-cli export trtllm meta-llama/Llama-2-7b-chat-hf --destination ./llama-2-trtllm

This will create a directory structure inside ./llama-2-trtllm containing the built engines for your GPU architecture.

Key Export Arguments

The command offers several options to customize the build, derived from the ExportConfig class.

  • --destination <path>: Folder where the resulting exported engines will be stored. If not provided, it defaults to the Hugging Face cache.
  • --max-batch-size <int>: Maximum number of concurrent requests the model can process. Defaults to 1.
  • --max-input-length <int>: Maximum sequence length for the prompt.
  • --max-output-length <int>: Maximum total sequence length (prompt + generated tokens) the model supports.
  • --tp <int>: The tensor parallelism degree. The model will be sharded across this many GPUs. Defaults to 1.
  • --pp <int>: The pipeline parallelism degree. Defaults to 1.
  • --dtype <str>: The data type to use (e.g., float16, bfloat16). Defaults to auto.
  • --quantization <path>: Path to a Python file containing a custom ModelOptRecipe for quantization.
  • --push-to-hub <repo_id>: Repository ID to push the generated engines to on the Hugging Face Hub.

Example: Multi-GPU Export with Quantization

Imagine you want to export Mixtral-8x7B to run on 2 GPUs with tensor parallelism and apply a custom quantization recipe.

First, your quantization recipe file (my_recipe.py) would look something like this:

# my_recipe.py
# (Contains a class that inherits from optimum.nvidia.compression.modelopt.ModelOptRecipe)
from my_quantization_library import MyCustomAWQRecipe

TARGET_QUANTIZATION_RECIPE = MyCustomAWQRecipe

Then, you would run the export command:

optimum-cli export trtllm mistralai/Mixtral-8x7B-v0.1 \
  --destination ./mixtral-tp2-awq \
  --tp 2 \
  --max-batch-size 16 \
  --max-input-length 2048 \
  --max-output-length 4096 \
  --quantization ./my_recipe.py

This command will produce a directory ./mixtral-tp2-awq containing two engine files (rank0.engine, rank1.engine) and the necessary configuration files, ready for deployment.