Multi-GPU training

Some of the architectures in metatensor-models support multi-GPU training. In multi-GPU training, every batch of samples is split into smaller mini-batches and the computation is run for each of the smaller mini-batches in parallel on different GPUs. The different gradients obtained on each device are then summed. This approach allows the user to reduce the time it takes to train models.

Here is a list of architectures supporting multi-GPU training:

SOAP-BPNN

SOAP-BPNN supports distributed multi-GPU training on SLURM environments. The options file to run distributed training with the SOAP-BPNN model looks like this:

seed: 42
device: cuda

architecture:
  name: soap_bpnn
  training:
    distributed: True
    batch_size: 25
    num_epochs: 100

training_set:
  systems:
    read_from: ethanol_reduced_100.xyz
    length_unit: angstrom
  targets:
    energy:
      key: energy
      unit: eV
      forces: on

test_set: 0.0
validation_set: 0.5

and the slurm submission script would look like this:

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 2
#SBATCH --ntasks-per-node 2
#SBATCH --gpus-per-node 2
#SBATCH --cpus-per-task 8
#SBATCH --exclusive
#SBATCH --time=1:00:00

# load modules and/or virtual environments and/or containers here

srun mtt train options-distributed.yaml