Quick Start Guide

Optimum-NVIDIA is designed to be a drop-in replacement for transformers for inference, allowing you to accelerate your models with minimal code changes.

Here are two common ways to get started: using the high-level pipeline API for simplicity, or the AutoModelForCausalLM class for more control.

Method 1: Using the pipeline API

The pipeline function provides a simple and powerful abstraction for running inference. If you are already using a transformers pipeline, you can switch to optimum-nvidia by changing a single import statement.

# Before: from transformers import pipeline
from optimum.nvidia import pipeline
from huggingface_hub import login

# Recommended: Login to Hugging Face Hub
# login("YOUR_HF_TOKEN")

# Initialize the pipeline
# use_fp8=True enables FP8 quantization for massive speedups on compatible hardware
pipe = pipeline(
    'text-generation', 
    model='meta-llama/Llama-2-7b-chat-hf', 
    use_fp8=True
)

# Run inference
prompt = "Describe a real-world application of AI in sustainable energy."
outputs = pipe(prompt)

print(outputs)

Behind the scenes, the pipeline function handles:

  1. Downloading the model from the Hugging Face Hub.
  2. Converting the model to a TensorRT-LLM engine on the fly.
  3. Running inference with the optimized engine.

Method 2: Using AutoModelForCausalLM

For more fine-grained control over generation parameters and model configuration, you can use the AutoModelForCausalLM class. This approach is also a near drop-in replacement for the transformers equivalent.

# Before: from transformers import AutoModelForCausalLM
from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "meta-llama/Llama-2-7b-chat-hf"

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Initialize the model with performance settings
# The from_pretrained call will handle the conversion to a TensorRT-LLM engine
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  use_fp8=True, # Enable FP8 quantization
  max_prompt_length=1024,
  max_output_length=2048, 
  max_batch_size=8,
)

# Prepare inputs
prompt = "How is autonomous vehicle technology transforming the future of transportation and urban planning?"
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

# Generate text
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=256, 
    top_k=40, 
    top_p=0.7, 
    repetition_penalty=1.1,
)

# Decode and print the output
output_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output_text)
This approach gives you direct access to the generate() method, where you can specify advanced decoding strategies like top_k, top_p, and repetition_penalty.