Quick Start Guide
Optimum-NVIDIA is designed to be a drop-in replacement for transformers
for inference, allowing you to accelerate your models with minimal code changes.
Here are two common ways to get started: using the high-level pipeline
API for simplicity, or the AutoModelForCausalLM
class for more control.
Method 1: Using the pipeline
API
The pipeline
function provides a simple and powerful abstraction for running inference. If you are already using a transformers
pipeline, you can switch to optimum-nvidia
by changing a single import statement.
# Before: from transformers import pipeline
from optimum.nvidia import pipeline
from huggingface_hub import login
# Recommended: Login to Hugging Face Hub
# login("YOUR_HF_TOKEN")
# Initialize the pipeline
# use_fp8=True enables FP8 quantization for massive speedups on compatible hardware
pipe = pipeline(
'text-generation',
model='meta-llama/Llama-2-7b-chat-hf',
use_fp8=True
)
# Run inference
prompt = "Describe a real-world application of AI in sustainable energy."
outputs = pipe(prompt)
print(outputs)
Behind the scenes, the pipeline
function handles:
- Downloading the model from the Hugging Face Hub.
- Converting the model to a TensorRT-LLM engine on the fly.
- Running inference with the optimized engine.
Method 2: Using AutoModelForCausalLM
For more fine-grained control over generation parameters and model configuration, you can use the AutoModelForCausalLM
class. This approach is also a near drop-in replacement for the transformers
equivalent.
# Before: from transformers import AutoModelForCausalLM
from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer
import torch
model_id = "meta-llama/Llama-2-7b-chat-hf"
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Initialize the model with performance settings
# The from_pretrained call will handle the conversion to a TensorRT-LLM engine
model = AutoModelForCausalLM.from_pretrained(
model_id,
use_fp8=True, # Enable FP8 quantization
max_prompt_length=1024,
max_output_length=2048,
max_batch_size=8,
)
# Prepare inputs
prompt = "How is autonomous vehicle technology transforming the future of transportation and urban planning?"
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
# Generate text
generated_ids = model.generate(
**model_inputs,
max_new_tokens=256,
top_k=40,
top_p=0.7,
repetition_penalty=1.1,
)
# Decode and print the output
output_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output_text)
This approach gives you direct access to the generate()
method, where you can specify advanced decoding strategies like top_k
, top_p
, and repetition_penalty
.