Deployment with Docker and Triton

optimum-nvidia is designed for high-performance deployment. The primary methods are using a self-contained Docker container or deploying engines with NVIDIA's Triton Inference Server for maximum throughput.

Deploying with Docker

The pre-built Docker container huggingface/optimum-nvidia is the most straightforward way to deploy an application. It bundles all necessary dependencies, including the CUDA toolkit, TensorRT-LLM, and the optimum-nvidia library.

You can use this container as a base for your own application's Dockerfile:

# Start from the official optimum-nvidia image
FROM huggingface/optimum-nvidia:latest

# Copy your application code
WORKDIR /app
COPY . .

# Install any additional application dependencies
RUN pip install -r requirements.txt

# Define the command to run your application
CMD ["python", "my_inference_server.py"]

This approach ensures a consistent and reproducible environment for your deployed model.

Deploying with Triton Inference Server

For production-grade, high-throughput scenarios, deploying with NVIDIA Triton Inference Server is recommended. optimum-nvidia can export models to a format directly compatible with Triton's tensorrtllm_backend.

The project includes templates for setting up a Triton model repository at templates/inference-endpoints/.

Triton Model Repository Structure

A typical Triton repository for an LLM consists of three models that form an ensemble:

preprocessing: A Python backend model that tokenizes the input text.
llm: The core tensorrtllm backend model that runs the TensorRT-LLM engine.
postprocessing: A Python backend model that decodes the output token IDs back to text.

This structure looks like this:

model_repository/
├── llm/
│   ├── 1/
│   │   ├── rank0.engine
│   │   └── config.json
│   └── config.pbtxt
├── preprocessing/
│   ├── 1/
│   │   └── model.py
│   └── config.pbtxt
├── postprocessing/
│   ├── 1/
│   │   └── model.py
│   └── config.pbtxt
└── text-generation/  (Ensemble)
    ├── 1/
    └── config.pbtxt

Steps for Triton Deployment

Export the Engine: Use optimum-cli export trtllm ... to build your TensorRT-LLM engine(s). Make sure to save them to a known location.
Set up the Repository: Copy the templates from templates/inference-endpoints/ to create your model repository. Place your exported engine(s) and config.json inside the llm/1/ directory.
Configure the llm model: Edit llm/config.pbtxt to point to your model's location and configure parameters like max_batch_size and decoupled mode for streaming.
Run Triton: Use the nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 Docker container (or a newer compatible version) to serve your model repository.

docker run --rm --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $(pwd)/model_repository:/repository \
  nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 \
  tritonserver --model-repository=/repository

This setup leverages Triton's advanced features like dynamic batching and concurrent model execution to achieve maximum performance.