Guide: Exporting Models with optimum-cli
While optimum-nvidia
can build TensorRT-LLM engines on-the-fly, you may want to create standalone, portable engine files for deployment or distribution. The optimum-cli
provides a powerful command for this purpose.
The export trtllm
Command
The export command handles the full conversion pipeline—from a Hugging Face model to optimized TensorRT-LLM engines—and saves the output to a specified directory.
Basic Usage
The basic syntax is:
optimum-cli export trtllm <model_id> [options]
For example, to export meta-llama/Llama-2-7b-chat-hf
to a local folder named ./llama-2-trtllm
:
optimum-cli export trtllm meta-llama/Llama-2-7b-chat-hf --destination ./llama-2-trtllm
This will create a directory structure inside ./llama-2-trtllm
containing the built engines for your GPU architecture.
Key Export Arguments
The command offers several options to customize the build, derived from the ExportConfig
class.
--destination <path>
: Folder where the resulting exported engines will be stored. If not provided, it defaults to the Hugging Face cache.--max-batch-size <int>
: Maximum number of concurrent requests the model can process. Defaults to1
.--max-input-length <int>
: Maximum sequence length for the prompt.--max-output-length <int>
: Maximum total sequence length (prompt + generated tokens) the model supports.--tp <int>
: The tensor parallelism degree. The model will be sharded across this many GPUs. Defaults to1
.--pp <int>
: The pipeline parallelism degree. Defaults to1
.--dtype <str>
: The data type to use (e.g.,float16
,bfloat16
). Defaults toauto
.--quantization <path>
: Path to a Python file containing a customModelOptRecipe
for quantization.--push-to-hub <repo_id>
: Repository ID to push the generated engines to on the Hugging Face Hub.
Example: Multi-GPU Export with Quantization
Imagine you want to export Mixtral-8x7B
to run on 2 GPUs with tensor parallelism and apply a custom quantization recipe.
First, your quantization recipe file (my_recipe.py
) would look something like this:
# my_recipe.py
# (Contains a class that inherits from optimum.nvidia.compression.modelopt.ModelOptRecipe)
from my_quantization_library import MyCustomAWQRecipe
TARGET_QUANTIZATION_RECIPE = MyCustomAWQRecipe
Then, you would run the export command:
optimum-cli export trtllm mistralai/Mixtral-8x7B-v0.1 \
--destination ./mixtral-tp2-awq \
--tp 2 \
--max-batch-size 16 \
--max-input-length 2048 \
--max-output-length 4096 \
--quantization ./my_recipe.py
This command will produce a directory ./mixtral-tp2-awq
containing two engine files (rank0.engine
, rank1.engine
) and the necessary configuration files, ready for deployment.