Training

The training engine is located in gluefactory/train.py. It handles model initialization, data loading, optimization, logging, and checkpointing.

Running Training

python -m gluefactory.train <experiment_name> --conf <path_to_config>

<experiment_name>: An identifier for the run. Results are saved to outputs/training/<experiment_name>/.
--conf: Path to a YAML config file or a registered config name.

Key Training Features

Distributed Training

To train on multiple GPUs on a single node, use the --distributed flag:

python -m gluefactory.train my_experiment --conf ... --distributed

Mixed Precision

Use --mixed_precision (or --mp) to enable automatic mixed precision (AMP) for lower memory usage and faster training:

python -m gluefactory.train my_experiment ... --mp float16

Restoring Training

To resume an interrupted run or fine-tune a model:

python -m gluefactory.train my_experiment --restore

Fine-tuning / Loading Weights

To load weights from a previous experiment into a new one (e.g., transferring from homography pre-training to MegaDepth), set train.load_experiment in your config or CLI:

python -m gluefactory.train new_experiment \
    --conf ... \
    train.load_experiment=old_experiment_name

TensorBoard Logging

Logs are written to outputs/training/<experiment_name>/. You can visualize them using TensorBoard:

tensorboard --logdir outputs/training/