Guide: Quantization with ModelOpt
Quantization is a key technique for optimizing LLM inference, reducing memory usage and latency. optimum-nvidia
integrates with NVIDIA's Model Optimization Toolkit to provide powerful quantization capabilities like FP8 and INT4-AWQ.
Core Concepts
ModelOptRecipe
: A protocol that defines the quantization configuration and the calibration dataset. You implement this class to specify how your model should be quantized.ModelOptQuantizer
: The internal component that takes aModelOptRecipe
and applies the quantization process to a Hugging Face model during thefrom_pretrained
call.
Common Quantization Methods
FP8 Quantization
For GPUs with FP8 support (Hopper and Ada Lovelace architectures), the easiest way to enable FP8 quantization is by using the use_fp8=True
flag in from_pretrained
or pipeline
:
from optimum.nvidia import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b",
use_fp8=True
)
This uses a default calibration recipe and applies FP8 quantization to weights and activations.
Custom Quantization Recipes (AWQ, W4A8, etc.)
For more advanced methods like AWQ (Activation-aware Weight Quantization), you need to define a custom ModelOptRecipe
. The example script examples/quantization.py
provides an excellent template.
Let's break down how to create a custom recipe for INT4-AWQ.
1. Define the ModelOptRecipe
Create a class that inherits from ModelOptRecipe
and implements the config
and dataset
properties.
from typing import Iterable
import numpy as np
import torch
from datasets import load_dataset
from modelopt.torch.quantization import INT4_AWQ_REAL_QUANT_CFG, QuantizeConfig
from transformers import PreTrainedTokenizer
from optimum.nvidia.compression.modelopt import ModelOptConfig, ModelOptRecipe
class C4AWQRecipe(ModelOptRecipe):
def __init__(self, tokenizer: PreTrainedTokenizer, num_samples: int = 512):
self._tokenizer = tokenizer
self._num_samples = num_samples
@property
def config(self) -> ModelOptConfig:
# Use a predefined configuration from ModelOpt for INT4-AWQ
qconfig = QuantizeConfig(**INT4_AWQ_REAL_QUANT_CFG)
return ModelOptConfig(qconfig, sparsity=None)
@property
def dataset(self) -> Iterable:
# Provide a representative dataset for calibration
data = load_dataset("allenai/c4", "en", split="train", streaming=True)
calib_data = []
for i, sample in enumerate(data):
if i >= self._num_samples:
break
tokenized_sample = self._tokenizer(
sample["text"],
truncation=True,
max_length=2048,
return_tensors="pt"
)
calib_data.append({
"input_ids": tokenized_sample["input_ids"].to("cuda")
})
return calib_data
2. Use the Recipe with from_pretrained
Instantiate your recipe and pass it to the quantization_config
argument.
from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer
model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Instantiate your custom recipe
awq_recipe = C4AWQRecipe(tokenizer)
# The library will now perform INT4-AWQ quantization during the build process
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=awq_recipe
)
# Now you can run inference with the quantized model
# ...
This process gives you full control over the quantization method and calibration data, allowing you to tailor the optimization to your specific needs.