PET¶
Metatrain training interface to the Point Edge transformer (PET) [1] architecture.
Installation¶
To install the package, you can run the following command in the root directory of the repository:
pip install metatrain[pet]
This will install the package with the PET dependencies.
Default Hyperparameters¶
The default hyperparameters for the PET model are:
architecture:
name: pet
model:
CUTOFF_DELTA: 0.2
AVERAGE_POOLING: False
TRANSFORMERS_CENTRAL_SPECIFIC: False
HEADS_CENTRAL_SPECIFIC: False
ADD_TOKEN_FIRST: True
ADD_TOKEN_SECOND: True
N_GNN_LAYERS: 3
TRANSFORMER_D_MODEL: 128
TRANSFORMER_N_HEAD: 4
TRANSFORMER_DIM_FEEDFORWARD: 512
HEAD_N_NEURONS: 128
N_TRANS_LAYERS: 3
ACTIVATION: silu
USE_LENGTH: True
USE_ONLY_LENGTH: False
R_CUT: 5.0
R_EMBEDDING_ACTIVATION: False
COMPRESS_MODE: mlp
BLEND_NEIGHBOR_SPECIES: False
AVERAGE_BOND_ENERGIES: False
USE_BOND_ENERGIES: True
USE_ADDITIONAL_SCALAR_ATTRIBUTES: False
SCALAR_ATTRIBUTES_SIZE: null
TRANSFORMER_TYPE: PostLN # PostLN or PreLN
USE_LONG_RANGE: False
K_CUT: null # should be float; only used when USE_LONG_RANGE is True
K_CUT_DELTA: null
DTYPE: float32 # float32 or float16 or bfloat16
N_TARGETS: 1
TARGET_INDEX_KEY: target_index
RESIDUAL_FACTOR: 0.5
USE_ZBL: False
training:
USE_LORA_PEFT: False
LORA_RANK: 4
LORA_ALPHA: 0.5
INITIAL_LR: 1e-4
EPOCH_NUM_ATOMIC: 1000000000
EPOCHS_WARMUP_ATOMIC: 100000000
SCHEDULER_STEP_SIZE_ATOMIC: 500000000 # structural version is called "SCHEDULER_STEP_SIZE"
GLOBAL_AUG: True
SLIDING_FACTOR: 0.7
ATOMIC_BATCH_SIZE: 850 # structural version is called "STRUCTURAL_BATCH_SIZE"
BALANCED_DATA_LOADER: False # if True, use DynamicBatchSampler from torch_geometric
MAX_TIME: 234000
ENERGY_WEIGHT: 0.1 # only used when fitting MLIP
MULTI_GPU: False
RANDOM_SEED: 0
CUDA_DETERMINISTIC: False
MODEL_TO_START_WITH: null
ALL_SPECIES_PATH: null
SELF_CONTRIBUTIONS_PATH: null
SUPPORT_MISSING_VALUES: False
USE_WEIGHT_DECAY: False
WEIGHT_DECAY: 0.0
DO_GRADIENT_CLIPPING: False
GRADIENT_CLIPPING_MAX_NORM: null # must be overwritten if DO_GRADIENT_CLIPPING is True
USE_SHIFT_AGNOSTIC_LOSS: False # only used when fitting general target. Primary use case: EDOS
ENERGIES_LOSS: per_structure # per_structure or per_atom
CHECKPOINT_INTERVAL: 100
Tuning Hyperparameters¶
Set
R_CUT
so that there are about 20-30 neighbors on average for your dataset.Fit the model with the default values for all the other hyperparameters.
Ensure that you fit the model long enough for the error to converge. (If not, you can always continue fitting the model from the last checkpoint.)
[Optional, recommended for large datasets] Increase the scheduler step size, and refit the model from scratch until convergence. Do this for several progressively increased values for the scheduler step size until convergence.
[Optional, this step aims to create a lighter and faster model, not to increase accuracy.] Set
N_TRANS_LAYERS
to 2 instead of 3, and repeat steps 3) and 4). If step 4) was already done for the defaultN_TRANS_LAYERS
value of 3, you can probably reuse the converged scheduler step size. The resulting model would be about 1.5 times faster than the default one, hopefully with very little deterioration of the accuracy or without any at all.[Optional, quite laborious, 99% you don’t need this] Read sections 6 and 7 of the PET paper, which discuss the architecture, main hyperparameters, and an ablation study illustrating their impact on the model’s accuracy. Design your own experiments.
More details:¶
There are two significant groups of hyperparameters controlling PET fits. The first group consists of the hyperparameters related to the model architecture itself, such as the number of layers, type of activation function, etc. The second group consists of settings that control how the fitting is done, such as batch size, the total number of epochs, learning rate, parameters of the learning rate scheduler, and so on.
Within conventional wisdom originating from traditional models, such as linear and kernel regression, the second group of hyperparameters that controls the optimizer’s behavior might seem unimportant. Indeed, when fitting linear or kernel models, the exact value of the optimum is usually achieved by linear algebra methods, and thus, the particular choice of optimizer has little importance.
However, with deep neural networks, the situation is drastically different. The exact minimum of the loss is typically never achieved; instead, the model asymptotically converges to it during fitting. It is essential to ensure that the total number of epochs is sufficient for the model to approach the optimum closely, thus achieving good accuracy.
In the case of PET, there is only one hyperparameter that MUST be manually adjusted for each new dataset: the cutoff radius. The selected cutoff significantly impacts the model’s accuracy and fitting/inference times, making it very sensitive to this hyperparameter. All other hyperparameters can be safely set to their default values. The next reasonable step (after fitting with default settings), especially for large datasets is to try to increase the duration of fitting and see if it improves the accuracy of the obtained model.
Selection of R_CUT
¶
A good starting point is to select a cutoff radius that ensures about 20-30 neighbors on average. This can be done by analyzing the neighbor lists for different cutoffs before launching the training script. This is an example of a neighbor list constructor in Python.
For finite configurations, such as small molecules in COLL/QM9/rmd17 datasets, it makes
sense to select R_CUT
large enough to encompass the whole molecule. For instance, it
can be set to 100 Å, as there are no numerical instabilities for arbitrarily large
cutoffs.
The hyperparameter for the cutoff radius is called R_CUT.
Selection of fitting duration¶
The second most important group of settings is the one that adjusts the fitting duration of the model. Unlike specifying a dataset-specific cutoff radius, this step is optional since reasonable results can be obtained with the default fitting duration. The time required to fit the model is a complex function of the model’s size, the dataset’s size, and the complexity of the studied interatomic interactions. The default value might be insufficient for large datasets. If the model is still underfit after the predefined number of epochs, the fitting procedure can be continued by relaunching the fitting script.
However, the total number of epochs is only part of the equation. Another key aspect is
the rate at which the learning rate decays. We use StepLR as a
learning rate scheduler. This scheduler reduces the learning rate by a factor of
gamma
(new_learning_rate = old_learning_rate * gamma
) every step_size
epochs. In the current implementation of PET, gamma
is fixed at 0.5, meaning that
the learning rate is halved every step_size
epochs.
If step_size
is set too small, the learning rate will decrease to very low values
too quickly, hindering the convergence of PET. Prolonged fitting under these conditions
will be ineffective due to the nearly zero learning rate. Therefore, achieving complete
convergence requires not only a sufficient number of epochs but also an appropriately
large step_size
. For typical moderately sized datasets, the default value should
suffice. However, for particularly large datasets, increasing step_size
may be
necessary to ensure complete convergence. The hyperparameter controlling the
step_size
of the StepLR learning rate scheduler is called SCHEDULER_STEP_SIZE
.
It is worth noting that the default step_size
is quite large. Thus, it is normal
if, when fitting on V100, which is quite slow, there is no event of lr rate decrease
during the first day or even during a couple of days. In addition, for some datasets,
the fitting might take longer than for others (related to inhomogeneous densities),
which can further postpone the first event of lr decrease.
The discussed convergence above, especially in terms of the total duration of fitting, should preferably be checked on log-log plots showing how the validation error depends on the epoch number. The raw log values are typically hard to extract useful insights from.
For hyperparameters like SCHEDULER_STEP_SIZE
, EPOCH_NUM
, BATCH_SIZE
, and
EPOCHS_WARMUP
, either normal or atomic versions can be specified.
SCHEDULER_STEP_SIZE
was discussed above; EPOCH_NUM
represents the total number
of epochs, and BATCH_SIZE
is the number of structures sampled in each minibatch for
a single step of stochastic gradient descent. The atomic versions are termed
SCHEDULER_STEP_SIZE_ATOMIC
, EPOCH_NUM_ATOMIC
, BATCH_SIZE_ATOMIC
, and
EPOCHS_WARMUP_ATOMIC
. The motivation for the atomic versions is to improve the
transferability of default hyperparameters across heterogeneous datasets. For instance,
using the same batch size for datasets with structures of very different sizes makes
little sense. If one dataset contains molecules with 10 atoms on average and another
contains nanoparticles with 1000 atoms, it makes sense to use a 100 times larger batch
size in the first case. If BATCH_SIZE_ATOMIC
is specified, the normal batch size is
computed as BATCH_SIZE = BATCH_SIZE_ATOMIC /
(average_number_of_atoms_in_the_training_dataset)
. Similar logic applies to
SCHEDULER_STEP_SIZE,
EPOCH_NUM,
and EPOCHS_WARMUP.
In these cases, normal
versions are obtained by division by the total number of atoms of structures in the
training dataset. All default values are given by atomic versions for better
transferability across various datasets.
To increase the step size of the learning rate scheduler by, for example, 2 times, take
the default value for SCHEDULER_STEP_SIZE_ATOMIC
from the default hypers and
specify a value that’s twice as large.
It is worth noting that the stopping criterion of PET is either exceeding the maximum
number of epochs (specified by EPOCH_NUM
or EPOCH_NUM_ATOMIC
) or exceeding the
specified maximum fitting time (controlled by the hyperparameter MAX_TIME
). By
default, the second criterion is used, with the default number of epochs set nearly to
infinity, while the default maximum time is set to be 65 hours.
Lightweight Model¶
The default hyperparameters were selected with one goal in mind: to maximize the
probability of achieving the best accuracy on a typical moderate-sized dataset. As a
result, some default hyperparameters might be excessive, meaning they could be adjusted
to significantly increase the model’s speed with minimal impact on accuracy. For
practical use, especially when conducting massive calculations where model speed is
crucial, it may be beneficial to set N_TRANS_LAYERS
to 2
instead of the default
value of 3
. The N_TRANS_LAYERS
hyperparameter controls the number of transformer
layers in each message-passing block (For more details see
Pozdnyakov and Ceriotti[1]). This adjustment would result in a model that is
about 1.5 times more lightweight and faster, with an expected minimal deterioration in
accuracy.
Description of the Architecture¶
This section contains a simplified description of the architecture covering most important macro-organization without all the details and nuances.
PET is a graph neural network (GNN) architecture featuring
N_GNN_LAYERS
message-passing layers. At each layer, messages are exchanged
between all atoms within a distance R_CUT
from each other. The functional
form of each layer is an arbitrarily deep transformer applied individually to
each atom. Atomic environments are constructed around each atom, defined by all
neighbors within R_CUT
. Each neighbor sends a message to the central atom,
with each message being a token of fixed size TRANSFORMER_D_MODEL
.
These tokens are processed by a transformer, which performs a permutationally
equivariant sequence-to-sequence transformation. The output sequence is then
treated as outbound messages from the central atom to all neighbors. Consequently,
for a model with N_GNN_LAYERS
layers and a system with N
atoms, there are
N_GNN_LAYERS
individual transformers with distinct weights, each independently
invoked N
times, resulting in N_GNN_LAYERS * N
transformer runs. The
number of input tokens for each transformer run is determined by the number of
neighbors of the central atom.
In addition to an input message from a neighboring atom, geometric information
about the displacement vector r_ij
from the central atom to the corresponding
neighbor is incorporated into the token. After each message-passing layer, all
output messages are fed into a head (individual for each message-passing layer),
implemented as a shallow MLP, to produce a contribution to the total prediction.
The total prediction is computed
as the sum of all head outputs over all message-passing layers and all messages.
This architecture is rigorously invariant with respect to translations because it uses displacement vectors that do not change if both the central atom and a neighbor are rigidly shifted. It is invariant with respect to permutations of identical atoms because the transformer defines a permutationally covariant sequence-to-sequence transformation, and the sum over the contributions from all edges yields an overall invariant energy prediction. However, it is not rotationally invariant since it operates with the raw Cartesian components of displacement vectors.
Architecture Hyperparameters¶
Warning
While PET supports CPU training, it is highly recommended to use a CUDA GPU for significantly faster training. CPU training can be very slow.
RANDOM_SEED
: random seedCUDA_DETERMINISTIC
: if applying PyTorch reproducibility settingsMULTI_GPU
: use multi-GPU training (on one node) using DataParallel from PyTorch-GeometricR_CUT
: cutoff radiusCUTOFF_DELTA
: width of the transition region for a cutoff function used by PET to ensure smoothness with respect to the (dis)appearance of atoms at the cutoff sphereGLOBAL_AUG
: whether to use global augmentation or a local one, rotating atomic environments independentlyUSE_ENERGIES
: whether to use energies for trainingUSE_FORCES
: whether to use forces for trainingSLIDING_FACTOR
: sliding factor for exponential sliding averages of MSE in energies and forces in our combined loss definitionENERGY_WEIGHT
: $w_{E}$, dimensionless energy weight in our combined loss definitionN_GNN_LAYERS
: number of message-passing blocksTRANSFORMER_D_MODEL
: was denoted as d_{pet} in the main text of the paperTRANSFORMER_N_HEAD
: number of heads of each transformerTRANSFORMER_DIM_FEEDFORWARD
: feedforward dimensionality of each transformerHEAD_N_NEURONS
: number of neurons in the intermediate layers of HEAD MLPsN_TRANS_LAYERS
: number of layers of each transformerACTIVATION
: activation function used everywhereINITIAL_LR
: initial learning rateMAX_TIME
: maximal time to train the model in seconds
For parameters such as EPOCH_NUM
the user can specify either normal
EPOCH_NUM
or EPOCH_NUM_ATOMIC
. If the second is specified, normal
EPOCH_NUM
is computed as EPOCH_NUM_ATOMIC / (total number of atoms in the
training dataset)
. Similarly defined are:
SCHEDULER_STEP_SIZE_ATOMIC
: step size of StepLR learning rate scheduleEPOCHS_WARMUP_ATOMIC
: linear warmup time
For the batch size, the normal version of batch size is computed as
BATCH_SIZE_ATOMIC / (average number of atoms in structures in the training
dataset)
.
ATOMIC_BATCH_SIZE
: batch size
USE_LENGTH
: explicitly use length in r embedding or notUSE_ONLY_LENGTH
: use only length in r embedding (used to get auxiliary intrinsically invariant models)USE_BOND_ENERGIES
: use bond contributions to energies or notAVERAGE_BOND_ENERGIES
: average bond contributions or sumBLEND_NEIGHBOR_SPECIES
: if True, explicitly encode embeddings of neighbor species to the overall embeddings in each message-passing block; if False, specify the very first input messages as embeddings of neighbor species insteadR_EMBEDDING_ACTIVATION
: apply or not activation after computing r embedding by a linear layerCOMPRESS_MODE
: if “mlp,” get overall embedding either by MLP; if “linear,” use simple linear compression insteadADD_TOKEN_FIRST
: add or not token associated with central atom for the very first message-passing blockADD_TOKEN_SECOND
: add or not token associated with central atom for all the others (to be renamed in future)AVERAGE_POOLING
: if not using a central token, controls if summation or average pooling is usedUSE_ADDITIONAL_SCALAR_ATTRIBUTES
: if using additional scalar attributes such as collinear spinsSCALAR_ATTRIBUTES_SIZE
: dimensionality of additional scalar attributes