Guide: Using Pipelines
The optimum.nvidia.pipeline
function offers the simplest, highest-level API for running inference. It abstracts away tokenization, model inference, and decoding, making it easy to get started.
Text Generation Pipeline
The most common use case is the text-generation
pipeline.
from optimum.nvidia import pipeline
# This single line handles model download, conversion, and engine build
pipe = pipeline("text-generation", "meta-llama/Llama-2-7b-chat-hf")
# The pipeline automatically handles tokenization and decoding
result = pipe("What is the latest generation of Nvidia's GPUs?", max_new_tokens=128)
print(result[0]['generated_text'])
How It Works
When you create a pipeline, optimum-nvidia
:
- Identifies the model architecture (
llama
,gemma
, etc.). - Instantiates the appropriate
AutoModelForCausalLM
class for that architecture. - Calls the model's
from_pretrained()
method, which triggers the automated TensorRT-LLM engine search or build process. - Wraps the optimized model and its tokenizer in a
TextGenerationPipeline
object.
Customizing Pipeline Creation
You can pass model-specific arguments directly into the pipeline
function. These are forwarded to the AutoModelForCausalLM.from_pretrained()
call. For example, to enable FP8 quantization and specify a tensor parallelism degree of 2:
pipe = pipeline(
"text-generation",
"meta-llama/Llama-2-7b-chat-hf",
use_fp8=True, # Enable FP8 quantization
tp=2 # Use 2-way tensor parallelism
)
Customizing Generation
Arguments for text generation can be passed during the __call__
.
result = pipe(
"What is the meaning of life?",
max_new_tokens=100,
do_sample=True,
temperature=0.8,
top_k=50
)
Benchmarking Pipelines
For performance analysis, optimum-nvidia
includes a benchmarking script that demonstrates how to measure throughput and latency. This script is a great reference for setting up performance tests.
You can find it at scripts/benchmark_pipelines.py
.
Example usage:
python scripts/benchmark_pipelines.py \
--model meta-llama/Llama-2-7b-chat-hf \
--batch-size 8 \
--prompt-length 512 \
--output-length 1024 \
--use-fp8
This script provides a standardized way to compare the performance of optimum-nvidia
against a baseline transformers
pipeline.