Machine Learning for Genomics Explorations (MLGenX)
The main objective of this workshop is to bridge the gap between machine learning (ML) and functional genomics (Gen), focusing on target identification---a pivotal aspect of drug discovery. Our goal is to explore this challenging aspect of modern drug development, where we aim to identify biological targets that play a critical role in modulating diseases. We will delve into the intersection of ML and genomics-related topics, with a specific focus on areas where the availability of data has expanded due to emerging technologies (e.g., large-scale genomic screens, single cell, and spatial omics platforms). From a biological perspective, our discussions will encompass sequence design, molecular perturbations, single cell, and spatial omics, shedding light on key biological questions in target identification. On the ML front, we aim to address topics such as interpretability, foundation models for genomics/biology, generalizability, and causal discovery, emphasizing the significance of ML in advancing target identification.
Overview
The critical bottleneck in drug discovery is still our limited understanding of the biological mechanisms underlying diseases. Consequently, often we do not know why patients develop specific diseases, and many drug candidates fail in clinical trials. Recent advancements in new genomics platforms and the development of diverse omics datasets have ignited a growing interest in the study of this field. In addition, machine learning plays a pivotal role in improving success rates in language processing, image analysis, and molecular design. The boundaries between these two domains are becoming increasingly blurred, particularly with the emergence of modern foundation models that stand at the intersection of data-driven approaches, self-supervised techniques, and genomic explorations. This workshop aims to elucidate the intricate relationship between genomics, target identification, and fundamental machine learning methods. By strengthening the connection between machine learning and target identification via genomics, new possibilities for interdisciplinary research in these areas will emerge.
The goal of this workshop is to bring together communities at the intersection of machine learning and genomics to discuss areas of interaction and explore possibilities for future areas of research.
During this workshop, participants will gain valuable insights into the synergies between ML and genomics-related research, and help refine the next generation of applied and theoretical ML methods for target identification. We look forward to your participation in this exciting discourse on the future of (foundational) genomics and AI.
Call for Papers
We consider a broad range of subject areas including but not limited to the following topics:
Foundation models for genomics
Biological sequence design
Interpretability and Generalizability in genomics
Causal representation learning
Perturbation biology
Modeling long-range dependencies in sequences, single-cell and spatial omics
Integrating multimodal perturbation readouts
Active learning in genomics
Generative models in Biology
Multimodal representation learning
Uncertainty quantification
Optimal transport
Experimental design for Biology
Graph neural network and knowledge graph
New datasets and benchmarks for genomics explorations
Both contributions introducing new ML methods to existing problems and those that highlighting and explaining open problems are welcome.
We also encourage submissions related to application of molecular biology, including but not limited to, single-cell RNA analysis, bulk RNA studies, proteomics, and microscopy imaging of cells and/or tissues.
Important Dates
All deadlines are 11:59 pm UTC -12h ("Anywhere on Earth"). All authors must have an OpenReview profile when submitting.
Submission Deadline: February 8, 2024
Acceptance Notification: March 3, 2024
Camera-Ready Deadline: April 26, 2024
Workshop Date: Saturday, May 11, 2024 (in-person)
Workshop Registration
Whether you're a seasoned professional or a curious enthusiast, all are welcome to attend! Don't worry if you don't have an accepted paper – participation is open to everyone.
If you have already registered for ICLR, you can join us at the MLGenX workshop. However, if you're solely interested in the workshop, you can still participate in the MLGenX workshop by registering for the "Saturday Workshop 1 Day Pass". Please visit this link to secure your spot.
We look forward to meeting you in Vienna!
Speakers & Panelists
Silvia Chiappa
Google DeepMind
James Zou
Stanford University
Jason Hartford
Recursion
Lindsay Edwards
CTO, Relation Therapeutics
Nicola Richmond
VP, BenevolentAI
Kyunghyun Cho
NYU, Genentech
Michael Bronstein
University of Oxford
Brian Hie
Stanford University
Bianca Dumitrascu
Columbia University
Schedule (CET)
Title: (Invited Speaker) Functional Causal Bayesian Optimization and DiscoGen for Learning Optimal Interventions and Inferring Gene Regulatory Networks
Presenter: Silvia Chiappa
Bio
https://csilviavr.github.io
Abstract
Advances in scientific disciplines such as biology and medicine require solving causal problems such as learning optimal interventions or inferring causal structure. In this talk, I will introduce Functional Causal Bayesian Optimization, a graph-based sequential decision making method that considers contextual interventions, and DiscoGen, a supervised transformer-based method that enables inferring large and cyclic gene regulatory networks.
Title: (Invited Speaker) Leveraging (natural) language models for biology
Presenter: James Y Zou
Bio
https://www.james-zou.com
Abstract
Biological language models, such as ESM2 for proteins, achieve impressive capabilities purely by learning correlation patterns between sequences. However they are mostly ignorant of the vast biological knowledge in literature and previous experiments. On the other hand, natural language models like GPT-4 have been trained on vast amount of scientific papers. In this talk, I will discuss our recent works on how to get the best of both worlds by leveraging natural language models for biology. I will present GenePT, an approach for projecting single cell data into the semantic space of GPT-4. Then I will discuss ProteinCLIP, which combines enhances capabilities of models like ESM2 by integrating text knowledge.
Title: (Oral Paper) DNA-DIFFUSION: Leveraging generative models for controlling chromatin accessibility and gene expression via synthetic regulatory elements
Presenter: Luca Pinello
Authors
Luca Pinello
Abstract
The challenge of systematically modifying and optimizing regulatory elements for precise gene expression control is central to modern genomics and synthetic biology. Advancements in generative AI have paved the way for designing synthetic sequences with the aim of safely and accurately modulating gene expression. We leverage diffusion models to design context-specific DNA regulatory sequences, which hold significant potential toward enabling novel therapeutic applications requiring precise modulation of gene expression. Our framework uses a cell type-specific diffusion model to generate synthetic 200 bp DNA regulatory elements based on chromatin accessibility across different cell types. We evaluate the generated sequences based on key metrics to ensure they retain properties of endogenous sequences: transcription factor binding site composition, potential for cell type-specific chromatin accessibility, and capacity for sequences generated by DNA diffusion to activate gene expression in different cell contexts using state-ofthe- art prediction models. Our results demonstrate the ability to robustly generate DNA sequences with cell type-specific regulatory potential. DNA-Diffusion paves the way for revolutionizing a regulatory modulation approach to mammalian synthetic biology and precision gene therapy.
Title: (Oral Paper) Dirichlet flow matching with applications to DNA sequence design
Presenter: Gabriele Corso
Authors
Hannes Stark, Bowen Jing, Chenyu Wang, Gabriele Corso, Bonnie Berger, Regina Barzilay, Tommi Jaakkola
Abstract
Discrete diffusion or flow models could enable faster and more controllable sequence generation than autoregressive models. We show that naive linear flow matching on the simplex is insufficient toward this goal since it suffers from discontinuities in the training target and further pathologies. To overcome this, we develop Dirichlet flow matching on the simplex based on mixtures of Dirichlet distributions as probability paths. In this framework, we derive a connection between the mixtures' scores and the flow's vector field that allows for classifier and classifier-free guidance. Further, we provide distilled Dirichlet flow matching, which enables one-step sequence generation with minimal performance hits, resulting in O(L) speedups compared to autoregressive models. On complex DNA sequence generation tasks, we demonstrate superior performance compared to all baselines in distributional metrics and in achieving desired design targets for generated sequences. Finally, we show that our classifier-free guidance approach improves unconditional generation and is effective for generating DNA that satisfies design targets.
Title:Panel Discussion Presenter: Kyunghyun Cho, Lindsay Edwards, Nicola Richmond, Bianca Dumitrascu, Michael Bronstein, Aïcha Bentaieb
Bio
Abstract
Foundation models in bilogy.
Title: (Invited Speaker) Efficiently detecting interactions from high dimensional observations of pairwise perturbations
Presenter: Jason Hartford
Bio
Jason Hartford is a Research Unit Lead and Staff Research Scientist at Valence Labs and an incoming Assistant Professor at the University of Waterloo and Vector Affiliate Member (starting July 2024). Previously, he was a postdoc with Prof Yoshua Bengio at Mila where he worked on causal representation learning. Before joining Mila, he completed his Master's and PhD at the University of British Columbia with Prof Kevin Leyton-Brown.
Abstract
In principle, pairwise perturbations—such as pairwise gene knockouts or drug combinations—allow us to observe interactions between perturbants, but experiments of this sort are expensive because experimental costs scale quadratically in the number of perturbants, and when your observations are high dimensional (e.g. microscopy images or expression data), it is not obvious how to measure these interactions. In this talk, I will discuss recent work that shows how we can combine representations of single gene perturbations to detect interactions in pairwise perturbations, and then I will show how this can be used as a reward in an active learning algorithm. By leveraging active learning, we are able to avoid the quadratic costs and efficiently find these interactions. Finally, I will discuss theory that gives a formal characterization of the assumptions that need to hold for the detected interactions to correspond to real biological interactions.
Title: (Oral Paper) Season combinatorial intervention predictions with Salt & Peper
Presenter: Thomas Gaudelet
Authors
Thomas Gaudelet, Alice Del Vecchio, Eli M Carrami, Juliana Cudini, Chantriolnt-Andreas Kapourani, Caroline Uhler, Lindsay Edwards
Abstract
In biology, interventions—particularly genetic ones enabled by CRISPR technologies—play a pivotal role in the study of complex systems. These interventions are instrumental in both identifying potential therapeutic targets and understanding the mechanisms of action for existing treatments. With the advancement of CRISPR and the proliferation of genome-scale analyses, the challenge shifts to navigating the vast combinatorial space of genetic interventions. Addressing this, our work concentrates on estimating the effects of pairwise genetic combinations. We introduce two novel contributions: Salt, a biologically-inspired baseline that posits the mostly additive nature of combination effects, and Peper, a deep learning model that extends Salt's additive assumption to achieve unprecedented accuracy. Our comprehensive comparison against existing state-of-the-art methods, grounded in diverse metrics, and our out-of-distribution analysis highlight the limitations of current models in realistic settings. This analysis underscores the necessity for improved modeling techniques and data acquisition strategies, paving the way for more effective exploration of genetic intervention effects.
Title: (Oral Paper) A mechanistically interpretable neural-network architecture for discovery of regulatory genomics
Presenter: Alex M Tseng
Authors
Alex M Tseng, Gökcen Eraslan, Nathaniel Lee Diamant, Tommaso Biancalani, Gabriele Scalia
Abstract
Deep neural networks have shown unparalleled success in mapping genomic DNA sequences to associated readouts such as protein–DNA binding. Beyond prediction, the goal of these networks is to then learn the underlying motifs (and their syntax) which drive genome regulation. Traditionally, this has been done by applying fragile and computationally expensive post-hoc analysis pipelines to trained models. Instead, we propose an entirely alternative method for learning motif biology from neural networks. We designed a mechanistically interpretable neural-network architecture for regulatory genomics, where motifs and their syntax are directly encoded and readable from the learned weights and activations, thus eliminating the need for post-hoc pipelines. Our model is also more robust to variable sequence contexts and against adversarial attacks, while attaining predictive performance comparable to its traditional counterparts.
Title: (Invited Speaker) Evo: Long-context modeling from molecular to genome scale
Presenter: Brian Hie
Bio
https://brianhie.com
Abstract
The genome is a sequence that completely encodes the DNA, RNA, and proteins that orchestrate the function of a whole organism. Advances in machine learning combined with massive datasets of whole genomes could enable a biological foundation model that accelerates the mechanistic understanding and generative design of complex molecular interactions. We report Evo, a genomic foundation model that enables prediction and generation tasks from the molecular to genome scale. Using an architecture based on advances in deep signal processing, we scale Evo to 7 billion parameters with a context length of 131 kilobases (kb) at single-nucleotide, byte resolution. Trained on whole prokaryotic genomes, Evo can generalize across the three fundamental modalities of the central dogma of molecular biology to perform zero-shot function prediction that is competitive with, or outperforms, leading domain-specific language models. Evo also excels at multielement generation tasks, which we demonstrate by generating synthetic CRISPR-Cas molecular complexes and entire transposable systems for the first time. Using information learned over whole genomes, Evo can also predict gene essentiality at nucleotide resolution and can generate coding-rich sequences up to 650 kb in length, orders of magnitude longer than previous methods. Advances in multi-modal and multi-scale learning with Evo provides a promising path toward improving our understanding and control of biology across multiple levels of complexity.
Time
Title
Presenter
09:00 - 09:15
Opening Remarks
09:15 - 09:50
(Invited Speaker) Functional Causal Bayesian Optimization and DiscoGen for Learning Optimal Interventions and Inferring Gene Regulatory Networks
Silvia Chiappa
09:50 - 10:00
Coffee Break
10:00 - 10:35
(Invited Speaker) Leveraging (natural) language models for biology
James Y Zou
10:40 - 11:00
(Oral Paper) DNA-DIFFUSION: Leveraging generative models for controlling chromatin accessibility and gene expression via synthetic regulatory elements
Luca Pinello
11:05 - 11:25
(Oral Paper) Dirichlet flow matching with applications to DNA sequence design