Customize a Dataset Configuration¶
Overview¶
The main task in setting up a training procedure with metatrain
is to provide
files for training, validation, and testing datasets. Our system allows flexibility in
parsing data for training. Mandatory sections in the options.yaml
file include:
training_set
test_set
validation_set
Each section can follow a similar system, with shorthand methods available to simplify dataset definitions.
Minimal Configuration Example¶
Below is the simplest form of these sections:
training_set: "dataset.xyz"
test_set: 0.1
validation_set: 0.1
This configuration parses all information from dataset.xyz
, with 20% of the training
set randomly selected for testing and validation (10% each). The selected indices for
the training, validation and test subset will be available in the outputs
directory.
Expanded Configuration Format¶
The train script automatically expands the training_set
section into the following
format, which is also valid for initial input:
training_set:
systems:
read_from: dataset.xyz
reader: ase
length_unit: null
targets:
energy:
quantity: energy
read_from: dataset.xyz
reader: ase
key: energy
unit: null
forces:
read_from: dataset.xyz
reader: ase
key: forces
stress:
read_from: dataset.xyz
reader: ase
key: stress
virial: false
test_set: 0.1
validation_set: 0.1
Understanding the YAML Block¶
The training_set
is divided into sections systems
and targets
:
Systems Section¶
Describes the system data like positions and cell information.
- param read_from:
The file containing system data.
- param reader:
The reader library to use for parsing, guessed from the file extension if
null
or not provided.- param length_unit:
The unit of lengths, optional but highly recommended for running simulations.
A single string in this section automatically expands, using the string as the
read_from
parameter.
Note
metatrain
does not convert units during training or evaluation. Units are
only required if model should be used to run MD simulations.
Targets Section¶
Allows defining multiple target sections, each with a unique name.
Commonly, a section named
energy
should be defined, which is essential for running molecular dynamics simulations. For theenergy
section gradients likeforces
andstress
are enabled by default.Other target sections can also be defined, as long as they are prefixed by
mtt::
. For example,mtt::free_energy
. In general, all targets that are not standard outputs ofmetatensor.torch.atomistic
(see https://docs.metatensor.org/latest/atomistic/outputs.html) should be prefixed bymtt::
.
Target section parameters include:
- param quantity:
The target’s quantity (e.g.,
energy
,dipole
). Currently onlyenergy
is supported.- param read_from:
The file for target data, defaults to the
systems.read_from
file if not provided.- param reader:
The reader library to use for parsing, guessed from the file extension if
null
or not provided.- param key:
The key for reading from the file, defaulting to the target section’s name if not provided.
- param unit:
The unit of the target, optional but highly recommended for running simulations.
- param forces:
Gradient sections. See Gradient Section for parameters.
- param stress:
Gradient sections. See Gradient Section for parameters.
- param virial:
Gradient sections. See Gradient Section for parameters.
A single string in a target section automatically expands, using the string as the
read_from
parameter.
Gradient Section¶
Each gradient section (like forces
or stress
) has similar parameters:
- param read_from:
The file for gradient data.
- param reader:
The reader library to use for parsing, guessed from the file extension if
null
or not provided.:param key: The key for reading from the file.
A single string in a gradient section automatically expands, using the string as the
read_from
parameter.
Sections set to true
or on
automatically expand with default parameters. A
warning is raised if requisite data for a gradient is missing, but training proceeds
without them.
Note
Unknown keys are ignored and not deleted in all sections during dataset parsing.
Multiple Datasets¶
For some applications, it is required to provide more than one dataset for model
training. metatrain
supports stacking several datasets together using the
YAML
list syntax, which consists of lines beginning at the same indentation level
starting with a "- "
(a dash and a space)
training_set:
- systems:
read_from: dataset_0.xyz
length_unit: angstrom
targets:
energy:
quantity: energy
key: my_energy_label0
unit: eV
- systems:
read_from: dataset_1.xyz
length_unit: angstrom
targets:
energy:
quantity: energy
key: my_energy_label1
unit: eV
free-energy:
quantity: energy
key: my_free_energy
unit: hartree
test_set: 0.1
validation_set: 0.1
The required test and validation splits are performed consistently for each element
element in training_set
The length_unit
has to be the same for each element of the list. If target section
names are the same for different elements of the list, their unit also has to be the
same. In the the example above the target section energy
exists in both list
elements and therefore has the the same unit eV
. The target section free-energy
only exists in the second element and its unit does not have to be the same as in the
first element of the list.
Warning
Even though parsing several datasets is supported by the library, it may not work with every architecture. Check your desired architecture if they support multiple datasets.
In the next tutorials we explain and show how to set some advanced global training parameters.