Guide: Text Generation with generate()

The generate() method provided by optimum.nvidia.AutoModelForCausalLM is the primary interface for text generation, offering a rich set of features for controlling the output.

Basic Generation

As shown in the Quick Start, the simplest way to generate text is to provide input_ids:

from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(model_id, use_fp8=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer("What are the best things to do in Paris?", return_tensors="pt").to("cuda")

# Generate text with default settings
generated_ids = model.generate(inputs["input_ids"], max_new_tokens=100)
decoded_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(decoded_text[0])

Advanced Decoding Strategies

TensorRT-LLM supports a wide range of decoding strategies. You can control these by passing arguments directly to the generate() method. These parameters correspond to the transformers GenerationConfig.

Here are some of the most common parameters:

  • max_new_tokens (int): The maximum number of new tokens to generate.
  • do_sample (bool): If True, enables sampling-based decoding (e.g., using temperature, top-k, top-p).
  • temperature (float): Modulates the next token probabilities. Lower values make the model more deterministic.
  • top_k (int): Restricts sampling to the k most likely next tokens.
  • top_p (float): Restricts sampling to a cumulative probability mass p (nucleus sampling).
  • repetition_penalty (float): Penalizes tokens that have already appeared in the text, discouraging repetition.
  • num_beams (int): The number of beams for beam-search decoding. Set to 1 for greedy or sampling decoding.

Example: Controlled Sampling

# Use top-k and top-p sampling with a specific temperature
generated_ids = model.generate(
    inputs["input_ids"],
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Asynchronous Generation

For applications requiring non-blocking execution, optimum-nvidia provides an asynchronous agenerate() method. It returns an awaitable coroutine that resolves to the generated token IDs. This is particularly useful in web servers or other I/O-bound applications.

See the example at examples/async-text-generation.py for a practical demonstration:

import asyncio
from transformers import AutoTokenizer
from optimum.nvidia import AutoModelForCausalLM

# ... (model and tokenizer setup as before)

async def main():
    prompt = "What is the latest generation of Nvidia GPUs?"
    tokens = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Await the generation result
    generated_ids = await model.agenerate(
        tokens["input_ids"],
        max_new_tokens=50
    )

    generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    print(generated_text)

if __name__ == "__main__":
    # ... (setup code)
    asyncio.run(main())