Guide: Using Pipelines

The optimum.nvidia.pipeline function offers the simplest, highest-level API for running inference. It abstracts away tokenization, model inference, and decoding, making it easy to get started.

Text Generation Pipeline

The most common use case is the text-generation pipeline.

from optimum.nvidia import pipeline

# This single line handles model download, conversion, and engine build
pipe = pipeline("text-generation", "meta-llama/Llama-2-7b-chat-hf")

# The pipeline automatically handles tokenization and decoding
result = pipe("What is the latest generation of Nvidia's GPUs?", max_new_tokens=128)

print(result[0]['generated_text'])

How It Works

When you create a pipeline, optimum-nvidia:

Identifies the model architecture (llama, gemma, etc.).
Instantiates the appropriate AutoModelForCausalLM class for that architecture.
Calls the model's from_pretrained() method, which triggers the automated TensorRT-LLM engine search or build process.
Wraps the optimized model and its tokenizer in a TextGenerationPipeline object.

Customizing Pipeline Creation

You can pass model-specific arguments directly into the pipeline function. These are forwarded to the AutoModelForCausalLM.from_pretrained() call. For example, to enable FP8 quantization and specify a tensor parallelism degree of 2:

pipe = pipeline(
    "text-generation", 
    "meta-llama/Llama-2-7b-chat-hf",
    use_fp8=True,       # Enable FP8 quantization
    tp=2                # Use 2-way tensor parallelism
)

Customizing Generation

Arguments for text generation can be passed during the __call__.

result = pipe(
    "What is the meaning of life?",
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_k=50
)

Benchmarking Pipelines

For performance analysis, optimum-nvidia includes a benchmarking script that demonstrates how to measure throughput and latency. This script is a great reference for setting up performance tests.

You can find it at scripts/benchmark_pipelines.py.

Example usage:

python scripts/benchmark_pipelines.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --batch-size 8 \
  --prompt-length 512 \
  --output-length 1024 \
  --use-fp8

This script provides a standardized way to compare the performance of optimum-nvidia against a baseline transformers pipeline.