Skip to content

GOLF: Gradual Optimization Learning for Conformational Energy Minimization

License

Notifications You must be signed in to change notification settings

AIRI-Institute/GOLF

Repository files navigation

(ICLR2024 Poster) Gradual Optimization Learning for Conformational Energy Minimization

Code style: black ICLR poster page Openreview Paper URL

This repository is the official implementation of the paper:

Tsypin, A., Ugadiarov, L. A., Khrabrov, K., Telepov, A., Rumiantsev, E., Skrynnik, A., ... & Kadurin, A. (2023, October).
Gradual Optimization Learning for Conformational Energy Minimization.
In The Twelfth International Conference on Learning Representations.

Experiments and results on the SPICE dataset can be found in the "GOLF-SPICE" branch.

Model $\overline{\text{pct}}_T (\%) \uparrow$ $\text{pct}_{\text{div}} (\%) \downarrow$ $\overline{E^{\text{res}}}_T\tiny{\text{(kc/mol)}}\downarrow$ $\text{pct}_{\text{success}} (\%) \uparrow$ $\text{COV} (\%) \uparrow$ $\text{MAT} (\text{Å}) \downarrow $
RDKit $84.92 \pm 10.6$ $\mathbf{0.05}$ $5.5$ $4.1$ $62.24$ $0.509$
Torsional Diffusion $25.63 \pm 21.4$ $46.9$ $33.8$ $0.0$ $11.3$ $1.333$
ConfOpt $36.48 \pm 23.0$ $84.5$ $27.9$ $0.2$ $19.88$ $1.05$
Uni-Mol+ $62.20 \pm 17.2$ $2.8$ $18.6$ $0.2$ $68.79$ $0.407$
$f^{\text{baseline}}$ $76.8 \pm 22.4$ $7.5$ $8.6$ $8.2$ $65.22$ $0.482$
$f^{\text{rdkit}}$ $93.09 \pm 11.9$ $3.8$ $2.8$ $35.4$ $71.6$ $0.426$
$f^{\text{traj-10k}}$ $95.3 \pm 7.3$ $4.5$ $2.0$ $37.0$ $70.55$ $0.440$
$f^{\text{traj-100k}}$ $96.3 \pm 9.8$ $2.9$ $1.5$ $52.7$ $71.43$ $0.432$
$f^{\text{traj-500k}}$ $98.4 \pm 9.2$ $1.8$ $\mathbf{0.5}$ $73.4$ $72.15$ $0.442$
$f^{\text{GOLF-1k}}$ $98.5 \pm 5.3$ $3.6$ $1.1$ $62.9$ $76.54$ $\mathbf{0.349}$
$f^{\text{GOLF-10k}}$ $\mathbf{99.4 \pm 5.2}$ $2.4$ $\mathbf{0.5}$ $\mathbf{77.3}$ $\mathbf{76.84}$ $0.355$

Training the NNP baseline

  1. Set up environment on the GPU machine.
    # On the GPU machine
    ./scripts/setup_gpu_env.sh
    conda activate GOLF_schnetpack
    pip install -r requirements.txt
    
  2. Download training dataset $\mathcal{D}_0$ and evaluation dataset $\mathcal{D}_{\text{test}}$
    mkdir data && cd data
    wget https://sc.link/FpEvS -O D-0.db
    wget https://sc.link/W6RUA -O D-test.db
    cd ../
    
  3. Train baseline PaiNN model
    cd scripts/train
    ./run_training_baseline.sh <cuda_device_number>
    
    Running this script will create a folder in the specified log_dir directory (we use "./results" in our configs and scripts). The name of the folder is specified by the exp_name hyperparameter. The folder will contain checkpoints, a metrics file and a config file with hyperparameters.

Training the NNP on optimization trajectories

  1. Set up environment on the GPU machine like in the first section
  2. Download optimization trajectories datasets.
    cd data
    wget https://sc.link/ZQRiV -O D-traj-10k.db
    wget https://sc.link/Z0ebo -O D-traj-100k.db
    wget https://sc.link/hj1JX -O D-traj-500k.db
    cd ../
    
  3. Train PaiNN.
    cd scripts/train
    ./run_training_trajectories-10k.sh <cuda_device_number>
    ./run_training_trajectories-100k.sh <cuda_device_number>
    ./run_training_trajectories-500k.sh <cuda_device_number>
    

Training NNPs with GOLF

Distributed Gradient Calculation with Psi4

To speed up the training, we parallelize DFT computations using several CPU-rich machines. The training of the NNP takes place on the parent machine with a GPU.

  1. Set up environment on the GPU machine like in the first section

  2. Log in to CPU-rich machines. They must be accessible via ssh.

  3. Set up environments on CPU-rich machines.

    # On CPU rich machines
    git clone https://github.com/AIRI-Institute/GOLF
    cd GOLF/scripts
    ./setup_dft_workers.sh <num_threads> <num_workers> <start_port>
    

    Here, n_ports denotes number of workers on a CPU-rich machine, and ports_range_begin denotes the starting port numbers for workers. Workers calculate energies and forces using psi4 for newly generated conformations. For example, ./setup_host.sh 24 20000 will launch a total of 48 workers listening to ports 20000, ... , 20023. You can change the ports_range_begin in env/dft.py.

    By default we assume that each worker uses 4 CPU-cores (can be changed in env/dft_worker.py, line 22) which means that n_ports must be less or equal to total_cpu_number / 4.

  4. Add ip addresses of CPU rich machines to a text file. We use env/host_names.txt.

Training with GOLF

Train PaiNN with GOLF.

cd scripts/train
./run_training_GOLF-10k.sh <cuda_device_number>

Evaluating NNPs

The evaluation can be done with or without psi4 energy estimation for NNP-optimization trajectories. The argument 'eval_early_stop_steps' controls for which conformations in the optimization trajectory to evaluate energy/forces with psi4. For example, setting eval_early_stop_steps to an empty list will result in no additional psi4 energy estimations, and setting it to [1 2 3 5 8 13 21 30 50 75 100] will result in 13 additional energy evaluations for each conformation in evaluation dataset. Note that in order to compute the $\overline{pct}_T$, optimal energies obtained with the genuine oracle $\mathcal{O}$ must be available. In our work, psi4.optimize with spherical representation of the molecule was used (approximately 30 steps until convergence).

In this repo, we provide NNPs pre-trained on different datasets and with GOLF in the checkpoints directory:

  • $f^{\text{baseline}}$ (checkpoints/baseline-NNP/NNP_checkpoint)
  • $f^{\text{traj-10k}}$ (checkpoints/traj-10k/NNP_checkpoint)
  • $f^{\text{traj-100k}}$ (checkpoints/traj-100k/NNP_checkpoint)
  • $f^{\text{traj-500k}}$ (checkpoints/trak-500k/NNP_checkpoint)
  • $f^{\text{GOLF-1k}}$ (checkpoints/GOLF-1k/NNP_checkpoint)
  • $f^{\text{GOLF-10k}}$ (checkpoints/GOLF-10k/NNP_checkpoint)

For example, to evaluate GOLF-10k and additionally calculate psi4 energies/forces along the optimization trajectory, run:

python evaluate_batch_dft.py --checkpoint_path checkpoints/GOLF-10k --agent_path NNP_checkpoint_actor --n_parallel 240 --n_threads 24 --conf_number -1 --host_file_path env/host_names.txt --eval_db_path data/GOLF_test.db --timelimit 100 --terminate_on_negative_reward False --reward dft --minimize_on_every_step False --eval_early_stop_steps 1 2 3 5 8 13 21 30 50 75 100

Make sure that n_threads is equal to the number of workers on each CPU-rich machine. Setting n_threads to a larger number will result in optimization failures. If you wish to only evaluate the last state in each optimization trajectory, set timelimit and eval_early_stop_steps to the same number: --timelimit T --eval_early_stop_steps T.

After the evaluation is finished, an evaluation_metrics.json file with per-step metrics will be created. Each record in evaluation_metrics.json describes optimization statistics for a single conformation and contains such metrics as: forces/energies MSE, percentage of optimized energy, predicted and ground-truth energies, etc. The final NNP-optimized conformations are stored in results.db database.

Citation

To cite this work, please use:

@inproceedings{tsypin2023gradual,
  title={Gradual Optimization Learning for Conformational Energy Minimization},
  author={Tsypin, Artem and Ugadiarov, Leonid Anatolievich and Khrabrov, Kuzma and Telepov, Alexander and Rumiantsev, Egor and Skrynnik, Alexey and Panov, Aleksandr and Vetrov, Dmitry P and Tutubalina, Elena and Kadurin, Artur},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2023}
}

About

GOLF: Gradual Optimization Learning for Conformational Energy Minimization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •