.. _dataset_conf:

Customize a Dataset Configuration
=================================
Overview
--------
The main task in setting up a training procedure with ``metatrain`` is to provide
files for training, validation, and testing datasets. Our system allows flexibility in
parsing data for training. Mandatory sections in the ``options.yaml`` file include:

- ``training_set``
- ``test_set``
- ``validation_set``

Each section can follow a similar system, with shorthand methods available to
simplify dataset definitions.

Minimal Configuration Example
-----------------------------
Below is the simplest form of these sections:

.. code-block:: yaml

    training_set: "dataset.xyz"
    test_set: 0.1
    validation_set: 0.1

This configuration parses all information from ``dataset.xyz``, with 20% of the training
set randomly selected for testing and validation (10% each). The selected indices for
the training, validation and test subset will be available in the ``outputs`` directory.

Expanded Configuration Format
-----------------------------
The train script automatically expands the ``training_set`` section into the following
format, which is also valid for initial input:

.. code-block:: yaml

    training_set:
        systems:
            read_from: dataset.xyz
            reader: ase
            length_unit: null
        targets:
            energy:
                quantity: energy
                read_from: dataset.xyz
                reader: ase
                key: energy
                unit: null
                forces:
                    read_from: dataset.xyz
                    reader: ase
                    key: forces
                stress:
                    read_from: dataset.xyz
                    reader: ase
                    key: stress
                virial: false
    test_set: 0.1
    validation_set: 0.1

Understanding the YAML Block
----------------------------
The ``training_set`` is divided into sections ``systems`` and ``targets``:

Systems Section
^^^^^^^^^^^^^^^
Describes the system data like positions and cell information.

:param read_from: The file containing system data.
:param reader: The reader library to use for parsing, guessed from the file extension if
    ``null`` or not provided.
:param length_unit: The unit of lengths, optional but highly recommended for running
    simulations.

A single string in this section automatically expands, using the string as the
``read_from`` parameter.

.. note::

   ``metatrain`` does not convert units during training or evaluation. Units are
   only required if model should be used to run MD simulations.

Targets Section
^^^^^^^^^^^^^^^
Allows defining multiple target sections, each with a unique name.

- Commonly, a section named ``energy`` should be defined, which is essential for running
  molecular dynamics simulations. For the ``energy`` section gradients like ``forces``
  and ``stress`` are enabled by default.
- Other target sections can also be defined, as long as they are prefixed by
  ``mtt::``. For example, ``mtt::free_energy``. In general, all targets that are
  not standard outputs of ``metatomic`` (see
  https://docs.metatensor.org/metatomic/latest/outputs/index.html) should be
  prefixed by ``mtt::``.

Target section parameters include:

:param quantity: The target's quantity (e.g., ``energy``, ``dipole``). Currently only
    ``energy`` is supported.
:param read_from: The file for target data, defaults to the ``systems.read_from``
  file if not provided.
:param reader: The reader library to use for parsing, guessed from the file extension if
    ``null`` or not provided.
:param key: The key for reading from the file, defaulting to the target section's name
  if not provided.
:param unit: The unit of the target, optional but highly recommended for running
    simulations.
:param forces: Gradient sections. See :ref:`gradient-section` for parameters.
:param stress: Gradient sections. See :ref:`gradient-section` for parameters.
:param virial: Gradient sections. See :ref:`gradient-section` for parameters.

A single string in a target section automatically expands, using the string as the
``read_from`` parameter.

.. _gradient-section:

Gradient Section
^^^^^^^^^^^^^^^^
Each gradient section (like ``forces`` or ``stress``) has similar parameters:

:param read_from: The file for gradient data.
:param reader: The reader library to use for parsing, guessed from the file extension if
    ``null`` or not provided.:param key: The key for reading from the file.

A single string in a gradient section automatically expands, using the string as the
``read_from`` parameter.

Sections set to ``true`` or ``on`` automatically expand with default parameters. A
warning is raised if requisite data for a gradient is missing, but training proceeds
without them.

.. note::

   Unknown keys are ignored and not deleted in all sections during dataset parsing.

Multiple Datasets
-----------------
For some applications, it is required to provide more than one dataset for model
training. ``metatrain`` supports stacking several datasets together using the
``YAML`` list syntax, which consists of lines beginning at the same indentation level
starting with a ``"- "`` (a dash and a space)


.. code-block:: yaml

    training_set:
        - systems:
              read_from: dataset_0.xyz
              length_unit: angstrom
          targets:
              energy:
                  quantity: energy
                  key: my_energy_label0
                  unit: eV
        - systems:
              read_from: dataset_1.xyz
              length_unit: angstrom
          targets:
              energy:
                  quantity: energy
                  key: my_energy_label1
                  unit: eV
              free-energy:
                  quantity: energy
                  key: my_free_energy
                  unit: hartree
    test_set: 0.1
    validation_set: 0.1

The required test and validation splits are performed consistently for each element
element in ``training_set``

The ``length_unit`` has to be the same for each element of the list. If target section
names are the same for different elements of the list, their unit also has to be the
same. In the the example above the target section ``energy`` exists in both list
elements and therefore has the the same unit ``eV``. The target section ``free-energy``
only exists in the second element and its unit does not have to be the same as in the
first element of the list.

Typically the global atomic types the the model is defined for are inferred from the
training and validation datasets. Sometimes, due to shuffling of datasets with low
representation of some types, these datasets may not contain all atomic types that you
want to use in your model. To explicitly control the atomic types the model is defined
for, specify the ``atomic_types`` key in the ``architecture`` section of the options
file:

.. code-block:: yaml

    architecture:
        name: pet
        model:
            cutoff: 5.0
        training:
            batch_size: 32
            epochs: 100
        atomic_types: [1, 6, 7, 8, 16]  # i.e. for H, C, N, O, S

.. warning::

   Even though parsing several datasets is supported by the library, it may not
   work with every architecture. Check your :ref:`desired architecture
   <available-architectures>` if they **support multiple datasets**.

In the next tutorials we explain and show how to set some advanced global training
parameters.

Datasets requiring additional data
----------------------------------
Some targets require additional data to be passed to the loss function for training.
For example, training a model to predict the polarization for extended systems under
periodic boundary conditions might require the quantum of polarization to be provided
for each system in the dataset.

``metatrain`` supports passing additional data in the ``options.yaml`` file.
For example, if you want to train a polarization model, you can add the following
section to your ``options.yaml`` file:

.. code-block:: yaml

    training_set:
        systems:
            read_from: dataset_0.xyz
            length_unit: angstrom
        targets:
            mtt::polarization:
                read_from: polarization.mts
        extra_data:
            polarization_quantum:
                read_from: polarization_quantum.mts

.. warning::

   While the ``extra_data`` section can always be present, it will typically be ignored
   unless using specific loss functions. If the loss function you picked does not
   support the extra data, it will be ignored.