Training
The training engine is located in gluefactory/train.py. It handles model initialization, data loading, optimization, logging, and checkpointing.
Running Training
python -m gluefactory.train <experiment_name> --conf <path_to_config>
<experiment_name>: An identifier for the run. Results are saved tooutputs/training/<experiment_name>/.--conf: Path to a YAML config file or a registered config name.
Key Training Features
Distributed Training
To train on multiple GPUs on a single node, use the --distributed flag:
python -m gluefactory.train my_experiment --conf ... --distributed
Mixed Precision
Use --mixed_precision (or --mp) to enable automatic mixed precision (AMP) for lower memory usage and faster training:
python -m gluefactory.train my_experiment ... --mp float16
Restoring Training
To resume an interrupted run or fine-tune a model:
python -m gluefactory.train my_experiment --restore
Fine-tuning / Loading Weights
To load weights from a previous experiment into a new one (e.g., transferring from homography pre-training to MegaDepth), set train.load_experiment in your config or CLI:
python -m gluefactory.train new_experiment \
--conf ... \
train.load_experiment=old_experiment_name
TensorBoard Logging
Logs are written to outputs/training/<experiment_name>/. You can visualize them using TensorBoard:
tensorboard --logdir outputs/training/