Training Guide
The training process for Open-Vocabulary SAM (OVSAM) is a multi-stage pipeline designed to effectively transfer knowledge between the SAM and CLIP models. This guide outlines the key steps.
All training and utility scripts are executed using tools/dist.sh
, which handles distributed training across multiple GPUs.
Step 1: Generate Language Embeddings
Before training, you need to generate and cache the language embeddings for your target dataset's class names. These embeddings serve as the classification targets for the open-vocabulary recognition head.
Use the gen_cls.py
script with the appropriate configuration file.
# Example for COCO dataset with 8 GPUs
bash tools/dist.sh gen_cls seg/configs/ovsam/ovsam_coco_rn50x16_point.py 8
This script will process the class names defined in the dataset configuration, generate text embeddings using the CLIP model, and save them for later use.
Step 2: SAM2CLIP Training
This stage transfers segmentation knowledge from SAM to the CLIP model.
2.1. Extract SAM Features
First, you must pre-compute and dump the image features from the SAM backbone for your training dataset. This speeds up the subsequent distillation process significantly.
# Example for SAM ViT-H backbone with 8 GPUs
bash tools/dist.sh test seg/configs/sam2clip/sam_vith_dump.py 8
This command runs the model in inference mode and saves the features to disk, as configured in the sam_distill.py
dataset config.
2.2. Train the SAM2CLIP Adapter
With the SAM features cached, you can now train the SAM2CLIP adapters. This step involves distilling the pre-extracted SAM features into the CLIP model.
# Example for ViT-H SAM and RN50x16 CLIP with 8 GPUs
bash tools/dist.sh train seg/configs/sam2clip/sam2clip_vith_rn50x16.py 8
This will produce a checkpoint containing the trained CLIP backbone and the multi-layer transformer neck (adapters).
Step 3: CLIP2SAM Training
In the final stage, the recognition knowledge from the now-enhanced CLIP model is transferred into the SAM mask decoder.
# Example for COCO dataset with 8 GPUs
bash tools/dist.sh train seg/configs/clip2sam/clip2sam_coco_rn50x16.py 8
This training step fine-tunes the SAM mask decoder and its classification head, using the frozen, pre-trained weights from the SAM encoder and the SAM2CLIP-enhanced CLIP model.
After this step, you will have a fully trained OVSAM model ready for inference.