API Reference
This page provides detailed information about the core classes and functions in optimum-nvidia
.
AutoModelForCausalLM
optimum.nvidia.AutoModelForCausalLM
This is the main class for loading and running causal language models. It acts as a factory, selecting the correct model-specific implementation (e.g., LlamaForCausalLM
) based on the model's configuration.
from_pretrained()
This class method is the primary entry point for loading a model.
classmethod from_pretrained(
model_id: str,
export_config: ExportConfig = None,
quantization_config: ModelOptRecipe = None,
use_fp8: bool = False,
force_export: bool = False,
**kwargs
)
Parameters:
model_id
(str
): The model identifier on the Hugging Face Hub or a path to a local directory.export_config
(ExportConfig
, optional): A configuration object specifying build parameters like batch size, sequence lengths, and parallelism. If not provided, a default is inferred from the model's config.quantization_config
(ModelOptRecipe
, optional): A recipe for applying advanced quantization (e.g., AWQ). See the Quantization Guide.use_fp8
(bool
, default=False
): A shortcut to enable FP8 quantization with a default calibration recipe. Ignored ifquantization_config
is provided.force_export
(bool
, default=False
): IfTrue
, forces the model to be rebuilt even if a cached engine exists.**kwargs
: Additional keyword arguments passed to the underlyingfrom_pretrained
calls.
CausalLM
optimum.nvidia.runtime.CausalLM
This is the base class for all causal language models in the library, providing the core inference methods.
generate()
Generates text sequences from a prompt.
generate(input_ids: torch.Tensor, **kwargs) -> List
Parameters:
input_ids
(torch.Tensor
): A tensor of token IDs representing the input prompt(s).**kwargs
: Generation parameters that override the model's defaultGenerationConfig
. Common arguments includemax_new_tokens
,do_sample
,temperature
,top_k
,top_p
,num_beams
, etc.
agenerate()
The asynchronous version of generate()
.
async agenerate(input_ids: torch.Tensor, **kwargs) -> List
Returns a coroutine that resolves to the list of generated token IDs.
ExportConfig
optimum.nvidia.ExportConfig
A data class that holds configuration for the TensorRT-LLM engine build process.
@dataclass
class ExportConfig:
dtype: str
max_input_len: int
max_output_len: int
max_batch_size: int
max_beam_width: int = 1
sharding: Optional[ShardingInfo] = None
# ... and other parameters
Key Attributes:
dtype
(str
): The data type for the engine (e.g.,"float16"
).max_input_len
(int
): Maximum prompt length.max_output_len
(int
): Maximum total sequence length.max_batch_size
(int
): Maximum batch size.max_beam_width
(int
): Maximum number of beams for beam search.sharding
(tensorrt_llm.Mapping
): Configuration for tensor and pipeline parallelism.
pipeline
optimum.nvidia.pipeline
A factory function to create a high-level inference pipeline.
pipeline(task: str, model: str, **kwargs)
Parameters:
task
(str
): The task to perform. Currently, only"text-generation"
is supported.model
(str
): A model identifier from the Hugging Face Hub.**kwargs
: Additional arguments passed toAutoModelForCausalLM.from_pretrained()
.