Scientists usually study the molecular machinery that controls gene expression from the perspective of a linear, two-dimensional genome — even though DNA and its bound proteins function in three dimensions (3D). To better understand how key components of this machinery, such as super-enhancers, regulate genes in this 3D reality, scientists at St. Jude Children’s Research Hospital developed a new algorithm called BOUQUET. Using machine learning, BOUQUET reveals that sets of genes and their regulatory elements can interact within protein condensates, high-density membraneless droplets, in cells’ nuclei. The findings, which provide new insight into how cells regulate the genes that control their specialized identities, were published today in Nucleic Acids Research.
Cells express certain sets of genes to carry out specific functions; for example, a blood cell and a brain cell express different context-specific genes. There are 3 billion base pairs of human DNA, and the genes involved in cell identity are scattered throughout. Even more challenging, enhancers, DNA elements that activate gene expression, can be thousands of DNA bases away from their target genes. Scientists led by Brian Abraham, PhD, St. Jude Department of Computational Biology, saw a problem in finding the full sets of enhancers and their accompanying proteins relevant to each gene’s expression across these large distances. To address this issue, they created BOUQUET to consider 3D enhancer architecture in a machine learning-based graph theory framework. Using this approach, researchers can identify which genes may be located inside transcriptional protein condensates.
“With BOUQUET, we can quantify the activating protein apparatus that is associated with each gene,” said Abraham, who is the corresponding author of the study. “This assignment gave us two major advances: predicting gene expression from protein binding maps and finding which genes are likely interacting with transcriptional condensates.”
Mapping controllers of cell identity
Enhancers activate gene expression by binding specific proteins and contacting target genes. In Abraham’s previous work, it was observed that sets of enhancers, called “super-enhancers,” were linearly proximal to genes encoding proteins with outsized roles in cell identity such as regulators of differentiation or those that enable cells to carry out identity-specific tasks.
“The idea that linear groups of enhancers, super-enhancers, play big roles in controlling cell identity has helped scientists understand many disease processes, but it’s been known for years that enhancers operate in 3D, so we sought to marry these two concepts,” added co-first author Kelsey Maher, PhD, Department of Computational Biology. “The data that measure these 3D interactions are complicated and noisy, so we had to use more sophisticated methods to find groups of enhancers and their target genes; that’s how we ended up using graph theory and machine learning to take in the whole network context and learn enhancer communities.”
While others have successfully grouped enhancers, the Abraham lab went one step further by incorporating protein binding maps. “It’s been presupposed that the amount of activating protein we can link with a certain gene should track with that gene’s expression, but finding a correlation like this is tricky without knowing which genome regions are important for each gene’s expression,” Abraham added. To their knowledge, his team is the first to show that enhancer/protein binding patterns do in fact quantitatively correlate with gene expression.
