The Intersection of Machine Learning and Drug Discovery
A comprehensive guide to leveraging advanced machine learning techniques across the drug discovery process — from molecular property prediction and generative chemistry to protein structure, docking, and ADMET modelling.
Watch — a short tour of this page, narrated in my own AI-cloned voice.
1. Introduction
Machine learning has become an indispensable tool in advancing modern drug discovery processes, offering innovative solutions across various stages from target identification and validation to lead optimisation. This article explores the applications of machine learning models in predicting molecular properties, designing novel molecules, understanding protein structures, and evaluating ADMET profiles.
2. Molecular Property Prediction
Machine learning algorithms can predict molecular properties such as biological activity, toxicity, and physicochemical characteristics[33]. These predictions are crucial for prioritising compounds in the early stages of drug development.
2.1 SMILES-Based Approaches
Models like REINVENT use SMILES strings to generate novel molecular designs by employing RNNs fine-tuned via reinforcement learning. This approach enables rapid iteration and optimisation in industrial settings.
2.2 Graph Approaches
Junction Tree Variational Autoencoders (JT-VAE)[34] generate molecular graphs directly, ensuring valence validity by construction and facilitating scaffold hopping. MoFlow employs normalising flows to estimate exact likelihoods for efficient sampling.
2.3 Diffusion Models
Models like DiffSBDD[35] condition ligand generation on protein pocket geometry, producing drug-like molecules with high binding affinities. These models outperform traditional autoregressive generators in terms of accuracy and drug-likeness.
3. Protein Structure and Molecular Docking
The advent of deep learning techniques has revolutionised protein structure prediction and molecular docking, enabling the design of highly effective drugs against previously unapproachable targets.
3.1 Classical Docking Methods
Tackling large-scale conformational sampling remains computationally intensive with traditional methods like AutoDock Vina. They provide approximate energy scoring functions but struggle with protein flexibility and large ligand libraries.
3.2 Deep Learning Docking
DiffDock[36] utilises an equivariant graph neural network to guide the diffusion process over SE(3) space, predicting docked poses more accurately and efficiently than classical methods. EquiBind further accelerates this process by predicting poses directly in a single forward pass.
3.3 Joint Structure Prediction with AlphaFold 3
AlphaFold 3 (Abramson et al., 2024)[39] generalises the AlphaFold 2 framework to jointly predict structures of complexes spanning proteins, nucleic acids, small-molecule ligands, ions, and post-translational modifications. Its Pairformer backbone is coupled to a diffusion module that directly denoises atomic coordinates, and on benchmark protein–ligand tasks it reports accuracy well above classical docking while also handling targets that previously required separate specialised tools.
4. ADMET Prediction
Predicting the ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties of drug candidates early in development is critical for reducing late-stage clinical trial failures[37]. Machine learning models trained on large datasets can predict these endpoints accurately.
4.1 Key ADMET Endpoints
Vital ADMET endpoints include Caco-2 permeability, plasma protein binding, microsomal stability, renal clearance, and hERG channel inhibition. These are commonly predicted using multi-task learning approaches.
4.2 Multi-Task Learning & Uncertainty
Sharing a common molecular encoder across related tasks improves performance on sparse-data endpoints through transfer learning. Essential for out-of-distribution compounds, uncertainty quantification flags low-confidence predictions to prevent overextrapolation.