Deployment with Docker and Triton
optimum-nvidia
is designed for high-performance deployment. The primary methods are using a self-contained Docker container or deploying engines with NVIDIA's Triton Inference Server for maximum throughput.
Deploying with Docker
The pre-built Docker container huggingface/optimum-nvidia
is the most straightforward way to deploy an application. It bundles all necessary dependencies, including the CUDA toolkit, TensorRT-LLM, and the optimum-nvidia
library.
You can use this container as a base for your own application's Dockerfile:
# Start from the official optimum-nvidia image
FROM huggingface/optimum-nvidia:latest
# Copy your application code
WORKDIR /app
COPY . .
# Install any additional application dependencies
RUN pip install -r requirements.txt
# Define the command to run your application
CMD ["python", "my_inference_server.py"]
This approach ensures a consistent and reproducible environment for your deployed model.
Deploying with Triton Inference Server
For production-grade, high-throughput scenarios, deploying with NVIDIA Triton Inference Server is recommended. optimum-nvidia
can export models to a format directly compatible with Triton's tensorrtllm_backend
.
The project includes templates for setting up a Triton model repository at templates/inference-endpoints/
.
Triton Model Repository Structure
A typical Triton repository for an LLM consists of three models that form an ensemble:
preprocessing
: A Python backend model that tokenizes the input text.llm
: The coretensorrtllm
backend model that runs the TensorRT-LLM engine.postprocessing
: A Python backend model that decodes the output token IDs back to text.
This structure looks like this:
model_repository/
├── llm/
│ ├── 1/
│ │ ├── rank0.engine
│ │ └── config.json
│ └── config.pbtxt
├── preprocessing/
│ ├── 1/
│ │ └── model.py
│ └── config.pbtxt
├── postprocessing/
│ ├── 1/
│ │ └── model.py
│ └── config.pbtxt
└── text-generation/ (Ensemble)
├── 1/
└── config.pbtxt
Steps for Triton Deployment
-
Export the Engine: Use
optimum-cli export trtllm ...
to build your TensorRT-LLM engine(s). Make sure to save them to a known location. -
Set up the Repository: Copy the templates from
templates/inference-endpoints/
to create your model repository. Place your exported engine(s) andconfig.json
inside thellm/1/
directory. -
Configure the
llm
model: Editllm/config.pbtxt
to point to your model's location and configure parameters likemax_batch_size
anddecoupled
mode for streaming. -
Run Triton: Use the
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
Docker container (or a newer compatible version) to serve your model repository.
docker run --rm --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/model_repository:/repository \
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 \
tritonserver --model-repository=/repository
This setup leverages Triton's advanced features like dynamic batching and concurrent model execution to achieve maximum performance.