Bioinformatics: Multi-Omics Integration and Protein Structure
High-throughput genomic technologies have enabled detailed investigation into the genetic and molecular underpinnings of complex diseases. Machine learning — through multi-omics factor analysis and multimodal language models — is now central to extracting mechanism from these datasets.
Watch — a short tour of this page, narrated in my own AI-cloned voice.
1. Mechanistic Analysis of Genomic Processes Using Machine Learning: A Case Study on Ulcerative Colitis
1.1 Introduction
The advent of high-throughput genomic technologies has enabled detailed investigation into the genetic and molecular underpinnings of complex diseases such as ulcerative colitis (UC). Despite significant progress in understanding UC, a comprehensive mechanistic analysis integrating diverse genomics data remains challenging. This section presents a machine learning approach to elucidate the regulatory mechanisms contributing to UC by integrating multiple omics layers.
1.2 Methods
Data Collection and Preprocessing
ATAC-seq, RNA-seq, and protein data are collected from affected and unaffected colonic tissues of patients diagnosed with ulcerative colitis. The raw sequencing reads are processed using standard pipelines to generate aligned BAM files and quantified gene expression matrices.
Data Integration
To integrate the heterogeneous omics datasets, MOFA+ (multi-omics factor analysis) is employed to learn shared latent factors across modalities. This approach identifies common regulatory processes underlying UC pathogenesis by leveraging both transcriptomic and epigenetic information.
1.3 Results
The integrated analysis reveals several key regulatory pathways that are dysregulated in ulcerative colitis tissues compared to healthy controls. Specifically, a set of transcription factors (TFs) and chromatin modifiers are identified whose activity is altered, leading to aberrant gene expression patterns associated with inflammation.
1.4 Discussion
This case study demonstrates the power of machine learning in uncovering mechanistic insights into complex diseases by integrating multiple omics layers. The regulatory network identified offers a promising framework for further investigation into UC pathogenesis and potential therapeutic targets.
2. Advancing Protein Structure Prediction with ESM-3: A Comprehensive Model for Multimodal Protein Design
The development of effective protein structure prediction models has been instrumental in advancing our understanding of biological systems. Building upon the groundbreaking work of AlphaFold 2[17] and ESMFold[18], the recent introduction of ESM3 (Hayes et al., 2025)[40] represents a significant leap forward.
2.1 ESM-3 Architecture
ESM-3 is designed to condition on sequence, structure, and functional annotations simultaneously. This multimodal conditioning enables fine-grained control over protein design across all three modalities, facilitating the creation of proteins with specific desired properties.
2.2 Applications in Drug Design
The ability to predict atomic-level structures without relying on multiple sequence alignments (MSAs) makes ESM-3 highly valuable for drug discovery. It can rapidly screen large protein databases and design novel binders, accelerating the process of identifying potential therapeutic candidates.