Guide: Quantization with ModelOpt

Quantization is a key technique for optimizing LLM inference, reducing memory usage and latency. optimum-nvidia integrates with NVIDIA's Model Optimization Toolkit to provide powerful quantization capabilities like FP8 and INT4-AWQ.

Core Concepts

ModelOptRecipe: A protocol that defines the quantization configuration and the calibration dataset. You implement this class to specify how your model should be quantized.
ModelOptQuantizer: The internal component that takes a ModelOptRecipe and applies the quantization process to a Hugging Face model during the from_pretrained call.

Common Quantization Methods

FP8 Quantization

For GPUs with FP8 support (Hopper and Ada Lovelace architectures), the easiest way to enable FP8 quantization is by using the use_fp8=True flag in from_pretrained or pipeline:

from optimum.nvidia import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b",
    use_fp8=True
)

This uses a default calibration recipe and applies FP8 quantization to weights and activations.

Custom Quantization Recipes (AWQ, W4A8, etc.)

For more advanced methods like AWQ (Activation-aware Weight Quantization), you need to define a custom ModelOptRecipe. The example script examples/quantization.py provides an excellent template.

Let's break down how to create a custom recipe for INT4-AWQ.

1. Define the ModelOptRecipe

Create a class that inherits from ModelOptRecipe and implements the config and dataset properties.

from typing import Iterable
import numpy as np
import torch
from datasets import load_dataset
from modelopt.torch.quantization import INT4_AWQ_REAL_QUANT_CFG, QuantizeConfig
from transformers import PreTrainedTokenizer

from optimum.nvidia.compression.modelopt import ModelOptConfig, ModelOptRecipe

class C4AWQRecipe(ModelOptRecipe):
    def __init__(self, tokenizer: PreTrainedTokenizer, num_samples: int = 512):
        self._tokenizer = tokenizer
        self._num_samples = num_samples

    @property
    def config(self) -> ModelOptConfig:
        # Use a predefined configuration from ModelOpt for INT4-AWQ
        qconfig = QuantizeConfig(**INT4_AWQ_REAL_QUANT_CFG)
        return ModelOptConfig(qconfig, sparsity=None)

    @property
    def dataset(self) -> Iterable:
        # Provide a representative dataset for calibration
        data = load_dataset("allenai/c4", "en", split="train", streaming=True)
        calib_data = []
        for i, sample in enumerate(data):
            if i >= self._num_samples:
                break
            tokenized_sample = self._tokenizer(
                sample["text"],
                truncation=True,
                max_length=2048,
                return_tensors="pt"
            )
            calib_data.append({
                "input_ids": tokenized_sample["input_ids"].to("cuda")
            })
        return calib_data

2. Use the Recipe with from_pretrained

Instantiate your recipe and pass it to the quantization_config argument.

from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Instantiate your custom recipe
awq_recipe = C4AWQRecipe(tokenizer)

# The library will now perform INT4-AWQ quantization during the build process
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=awq_recipe
)

# Now you can run inference with the quantized model
# ...

This process gives you full control over the quantization method and calibration data, allowing you to tailor the optimization to your specific needs.