Running iCluto on a SLURM Cluster
This guide explains how to install and run iCluto on clusters using the SLURM workload manager, focusing on robust self-supervised training (DINO).
Installation
Installation on SLURM Cluster (using Spack)
If your cluster uses Spack for package management, follow these steps to set up iCluto:
1. Get Spack
Clone Spack to your home directory (if not already present) and load its environment:
git clone -c feature.manyFiles=true https://github.com/spack/spack.git
source spack/share/spack/setup-env.sh
# check the installation
spack list
2. Install and Load Python 3.11
spack install python@3.11
spack load python@3.11
3. Install iCluto
Transfer the latest .tar.gz or .whl (v0.1.9) from your laptop to the cluster, then install it in a dedicated virtual environment:
# Example: Using the source distribution
tar -xzf icluto-0.1.9.tar.gz
cd icluto-0.1.9
# Create and activate virtual environment
python -m venv venv_icluto
source venv_icluto/bin/activate
# Install iCluto and its CLI training scripts
pip install .
Robust Training & Checkpointing
For long-running training tasks (like DINO), iCluto provides a robust checkpointing system that saves the full training state, including Teacher/Student weights, Optimizer momentum, and LR Scheduler state.
Continuing Training
You can resume training from the last saved state using the --resume flag:
# Automatically find and resume from the latest checkpoint in the output folder
icluto-train-dino data/traces.npy --resume auto
# Resume from a specific checkpoint file
icluto-train-dino data/traces.npy --resume out/dino/run1/weights/dino_model_epoch50.pth
SLURM Signal Handling
The training scripts are designed to catch the SLURM SIGUSR1 signal (sent shortly before the job walltime is reached). When this signal is received, the script will exit gracefully, allowing a SLURM trap to handle job re-submission.
Automatic Job Resubmission
To handle job timeouts and pre-emptions automatically, you can use a submission script with trap resubmit EXIT.
The STOP File Mechanism
To manually cancel a re-submitting job, touch STOP in the output root directory. The script will detect this file on exit and prevent further auto-resubmissions.
Example: train_dino_resubmit.sbatch
#!/bin/bash
#SBATCH --job-name=dino_train
#SBATCH --time=4:00:00
#SBATCH --signal=USR1@60 # Send signal 60s before timeout
function resubmit {
# .finished is created by the python script upon reaching the final epoch
FINISHED_MARKER="$OUTPUT_DIR/$RUN_NAME/weights/dino_model.finished"
STOP_FILE="$OUTPUT_DIR/STOP"
if [ -f "$FINISHED_MARKER" ]; then
echo "Training finished successfully."
elif [ -f "$STOP_FILE" ]; then
echo "Manual STOP detected. Cancelling auto-resubmit."
rm "$STOP_FILE"
else
echo "Job not finished. Resubmitting..."
sbatch --export=ALL "$0"
fi
}
trap resubmit EXIT
set -e
# Load environment
source $HOME/spack/share/spack/setup-env.sh
spack load python@3.11
source $HOME/icluto_staging/icluto-0.1.9/venv_icluto/bin/activate
# Run training
icluto-train-dino data/traces.npy --output_dir "$OUTPUT_DIR" --resume auto
Job Arrays & Sweeps
For hyperparameter sweeps (e.g., testing different patch sizes), use SLURM Job Arrays. See scripts/submit_sweep.sh for an example of how to iterate through a grid and launch multiple re-submitting tasks efficiently.
Monitoring with TensorBoard
To monitor your training progress live while it runs on the cluster:
- On the Cluster node:
bash tensorboard --logdir out/logs/ --port 6006 - On your Laptop (SSH Port Forwarding):
bash ssh -L 6006:localhost:6006 your-user@cluster-login-node - Local access: Navigate to http://localhost:6006.