API Reference

This page provides detailed information about the core classes and functions in optimum-nvidia.

AutoModelForCausalLM

optimum.nvidia.AutoModelForCausalLM

This is the main class for loading and running causal language models. It acts as a factory, selecting the correct model-specific implementation (e.g., LlamaForCausalLM) based on the model's configuration.

`from_pretrained()`

This class method is the primary entry point for loading a model.

classmethod from_pretrained(
    model_id: str,
    export_config: ExportConfig = None,
    quantization_config: ModelOptRecipe = None,
    use_fp8: bool = False,
    force_export: bool = False,
    **kwargs
)

Parameters:

model_id (str): The model identifier on the Hugging Face Hub or a path to a local directory.
export_config (ExportConfig, optional): A configuration object specifying build parameters like batch size, sequence lengths, and parallelism. If not provided, a default is inferred from the model's config.
quantization_config (ModelOptRecipe, optional): A recipe for applying advanced quantization (e.g., AWQ). See the Quantization Guide.
use_fp8 (bool, default=False): A shortcut to enable FP8 quantization with a default calibration recipe. Ignored if quantization_config is provided.
force_export (bool, default=False): If True, forces the model to be rebuilt even if a cached engine exists.
**kwargs: Additional keyword arguments passed to the underlying from_pretrained calls.

CausalLM

optimum.nvidia.runtime.CausalLM

This is the base class for all causal language models in the library, providing the core inference methods.

`generate()`

Generates text sequences from a prompt.

generate(input_ids: torch.Tensor, **kwargs) -> List

Parameters:

input_ids (torch.Tensor): A tensor of token IDs representing the input prompt(s).
**kwargs: Generation parameters that override the model's default GenerationConfig. Common arguments include max_new_tokens, do_sample, temperature, top_k, top_p, num_beams, etc.

`agenerate()`

The asynchronous version of generate().

async agenerate(input_ids: torch.Tensor, **kwargs) -> List

Returns a coroutine that resolves to the list of generated token IDs.

ExportConfig

optimum.nvidia.ExportConfig

A data class that holds configuration for the TensorRT-LLM engine build process.

@dataclass
class ExportConfig:
    dtype: str
    max_input_len: int
    max_output_len: int
    max_batch_size: int
    max_beam_width: int = 1
    sharding: Optional[ShardingInfo] = None
    # ... and other parameters

Key Attributes:

dtype (str): The data type for the engine (e.g., "float16").
max_input_len (int): Maximum prompt length.
max_output_len (int): Maximum total sequence length.
max_batch_size (int): Maximum batch size.
max_beam_width (int): Maximum number of beams for beam search.
sharding (tensorrt_llm.Mapping): Configuration for tensor and pipeline parallelism.

pipeline

optimum.nvidia.pipeline

A factory function to create a high-level inference pipeline.

pipeline(task: str, model: str, **kwargs)

Parameters:

task (str): The task to perform. Currently, only "text-generation" is supported.
model (str): A model identifier from the Hugging Face Hub.
**kwargs: Additional arguments passed to AutoModelForCausalLM.from_pretrained().