Training Guide

The training process for Open-Vocabulary SAM (OVSAM) is a multi-stage pipeline designed to effectively transfer knowledge between the SAM and CLIP models. This guide outlines the key steps.

All training and utility scripts are executed using tools/dist.sh, which handles distributed training across multiple GPUs.

Step 1: Generate Language Embeddings

Before training, you need to generate and cache the language embeddings for your target dataset's class names. These embeddings serve as the classification targets for the open-vocabulary recognition head.

Use the gen_cls.py script with the appropriate configuration file.

# Example for COCO dataset with 8 GPUs
bash tools/dist.sh gen_cls seg/configs/ovsam/ovsam_coco_rn50x16_point.py 8

This script will process the class names defined in the dataset configuration, generate text embeddings using the CLIP model, and save them for later use.

Step 2: SAM2CLIP Training

This stage transfers segmentation knowledge from SAM to the CLIP model.

2.1. Extract SAM Features

First, you must pre-compute and dump the image features from the SAM backbone for your training dataset. This speeds up the subsequent distillation process significantly.

# Example for SAM ViT-H backbone with 8 GPUs
bash tools/dist.sh test seg/configs/sam2clip/sam_vith_dump.py 8

This command runs the model in inference mode and saves the features to disk, as configured in the sam_distill.py dataset config.

2.2. Train the SAM2CLIP Adapter

With the SAM features cached, you can now train the SAM2CLIP adapters. This step involves distilling the pre-extracted SAM features into the CLIP model.

# Example for ViT-H SAM and RN50x16 CLIP with 8 GPUs
bash tools/dist.sh train seg/configs/sam2clip/sam2clip_vith_rn50x16.py 8

This will produce a checkpoint containing the trained CLIP backbone and the multi-layer transformer neck (adapters).

Step 3: CLIP2SAM Training

In the final stage, the recognition knowledge from the now-enhanced CLIP model is transferred into the SAM mask decoder.

# Example for COCO dataset with 8 GPUs
bash tools/dist.sh train seg/configs/clip2sam/clip2sam_coco_rn50x16.py 8

This training step fine-tunes the SAM mask decoder and its classification head, using the frozen, pre-trained weights from the SAM encoder and the SAM2CLIP-enhanced CLIP model.

After this step, you will have a fully trained OVSAM model ready for inference.