Guide: Text Generation with generate()
The generate()
method provided by optimum.nvidia.AutoModelForCausalLM
is the primary interface for text generation, offering a rich set of features for controlling the output.
Basic Generation
As shown in the Quick Start, the simplest way to generate text is to provide input_ids
:
from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, use_fp8=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer("What are the best things to do in Paris?", return_tensors="pt").to("cuda")
# Generate text with default settings
generated_ids = model.generate(inputs["input_ids"], max_new_tokens=100)
decoded_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(decoded_text[0])
Advanced Decoding Strategies
TensorRT-LLM supports a wide range of decoding strategies. You can control these by passing arguments directly to the generate()
method. These parameters correspond to the transformers
GenerationConfig
.
Here are some of the most common parameters:
max_new_tokens
(int): The maximum number of new tokens to generate.do_sample
(bool): IfTrue
, enables sampling-based decoding (e.g., using temperature, top-k, top-p).temperature
(float): Modulates the next token probabilities. Lower values make the model more deterministic.top_k
(int): Restricts sampling to thek
most likely next tokens.top_p
(float): Restricts sampling to a cumulative probability massp
(nucleus sampling).repetition_penalty
(float): Penalizes tokens that have already appeared in the text, discouraging repetition.num_beams
(int): The number of beams for beam-search decoding. Set to1
for greedy or sampling decoding.
Example: Controlled Sampling
# Use top-k and top-p sampling with a specific temperature
generated_ids = model.generate(
inputs["input_ids"],
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95
)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])
Asynchronous Generation
For applications requiring non-blocking execution, optimum-nvidia
provides an asynchronous agenerate()
method. It returns an awaitable coroutine that resolves to the generated token IDs. This is particularly useful in web servers or other I/O-bound applications.
See the example at examples/async-text-generation.py
for a practical demonstration:
import asyncio
from transformers import AutoTokenizer
from optimum.nvidia import AutoModelForCausalLM
# ... (model and tokenizer setup as before)
async def main():
prompt = "What is the latest generation of Nvidia GPUs?"
tokens = tokenizer(prompt, return_tensors="pt").to("cuda")
# Await the generation result
generated_ids = await model.agenerate(
tokens["input_ids"],
max_new_tokens=50
)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text)
if __name__ == "__main__":
# ... (setup code)
asyncio.run(main())