API Reference
This page provides detailed information about the core classes and functions in optimum-nvidia.
AutoModelForCausalLM
optimum.nvidia.AutoModelForCausalLM
This is the main class for loading and running causal language models. It acts as a factory, selecting the correct model-specific implementation (e.g., LlamaForCausalLM) based on the model's configuration.
from_pretrained()
This class method is the primary entry point for loading a model.
classmethod from_pretrained(
model_id: str,
export_config: ExportConfig = None,
quantization_config: ModelOptRecipe = None,
use_fp8: bool = False,
force_export: bool = False,
**kwargs
)
Parameters:
model_id(str): The model identifier on the Hugging Face Hub or a path to a local directory.export_config(ExportConfig, optional): A configuration object specifying build parameters like batch size, sequence lengths, and parallelism. If not provided, a default is inferred from the model's config.quantization_config(ModelOptRecipe, optional): A recipe for applying advanced quantization (e.g., AWQ). See the Quantization Guide.use_fp8(bool, default=False): A shortcut to enable FP8 quantization with a default calibration recipe. Ignored ifquantization_configis provided.force_export(bool, default=False): IfTrue, forces the model to be rebuilt even if a cached engine exists.**kwargs: Additional keyword arguments passed to the underlyingfrom_pretrainedcalls.
CausalLM
optimum.nvidia.runtime.CausalLM
This is the base class for all causal language models in the library, providing the core inference methods.
generate()
Generates text sequences from a prompt.
generate(input_ids: torch.Tensor, **kwargs) -> List
Parameters:
input_ids(torch.Tensor): A tensor of token IDs representing the input prompt(s).**kwargs: Generation parameters that override the model's defaultGenerationConfig. Common arguments includemax_new_tokens,do_sample,temperature,top_k,top_p,num_beams, etc.
agenerate()
The asynchronous version of generate().
async agenerate(input_ids: torch.Tensor, **kwargs) -> List
Returns a coroutine that resolves to the list of generated token IDs.
ExportConfig
optimum.nvidia.ExportConfig
A data class that holds configuration for the TensorRT-LLM engine build process.
@dataclass
class ExportConfig:
dtype: str
max_input_len: int
max_output_len: int
max_batch_size: int
max_beam_width: int = 1
sharding: Optional[ShardingInfo] = None
# ... and other parameters
Key Attributes:
dtype(str): The data type for the engine (e.g.,"float16").max_input_len(int): Maximum prompt length.max_output_len(int): Maximum total sequence length.max_batch_size(int): Maximum batch size.max_beam_width(int): Maximum number of beams for beam search.sharding(tensorrt_llm.Mapping): Configuration for tensor and pipeline parallelism.
pipeline
optimum.nvidia.pipeline
A factory function to create a high-level inference pipeline.
pipeline(task: str, model: str, **kwargs)
Parameters:
task(str): The task to perform. Currently, only"text-generation"is supported.model(str): A model identifier from the Hugging Face Hub.**kwargs: Additional arguments passed toAutoModelForCausalLM.from_pretrained().