Multi-GPU training¶
Some of the architectures in metatensor-models support multi-GPU training. In multi-GPU training, every batch of samples is split into smaller mini-batches and the computation is run for each of the smaller mini-batches in parallel on different GPUs. The different gradients obtained on each device are then summed. This approach allows the user to reduce the time it takes to train models.
Here is a list of architectures supporting multi-GPU training:
SOAP-BPNN¶
SOAP-BPNN supports distributed multi-GPU training on SLURM environments. The options file to run distributed training with the SOAP-BPNN model looks like this:
seed: 42
device: cuda
architecture:
name: soap_bpnn
training:
distributed: True
batch_size: 25
num_epochs: 100
training_set:
systems:
read_from: ethanol_reduced_100.xyz
length_unit: angstrom
targets:
energy:
key: energy
unit: eV
forces: on
test_set: 0.0
validation_set: 0.5
and the slurm submission script would look like this:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 2
#SBATCH --ntasks-per-node 2
#SBATCH --gpus-per-node 2
#SBATCH --cpus-per-task 8
#SBATCH --exclusive
#SBATCH --time=1:00:00
# load modules and/or virtual environments and/or containers here
srun mtt train options-distributed.yaml