In 2014, Jim Hughes, James Davies and their team at the MRC Weatherall Institute of Molecular Medicine at the University of Oxford first published the Capture-C method, creating a radically higher resolution method to map the human genome in 3D. Having built on the invention, Oxford Science Enterprises invested in Nucleome Therapeutics in 2019, a company formed from the University of Oxford to develop the technology. In this blog, Oxford Science Enterprises Life Sciences Principal Lachlan MacKinnon describes the importance of mapping the 3D genome and the context for our investment.
“In another fifteen or twenty years you will see a complete transformation in therapeutic medicine,” remarked Francis Collins, then head of the Human Genome Project (HGP) in 2000. Spanning five nations and thirteen years from 1990–2003, the HGP remains the largest biological research project ever conducted. It would, as President Bill Clinton noted in the same year, “revolutionize the diagnosis, prevention and treatment of most, if not all, human diseases.”
It was challenging: after the DNA double helix structure was discovered by James Watson and Francis Crick in 1953, it took nearly fifty years to sequence the entire human genome. As understanding of the significance of DNA’s role in disease grew, the need to know the full sequence of the three billion bases — the building blocks A, C, T and G — became an imperative. The race was well documented. The promise: to finally understand the true genetic origins of disease, of ageing, of the staggering complexity around us — to crack the code of life. By understanding the underlying code, biology would become tractable, disease systemisable and treatments obvious.
We knew that DNA encoded proteins, and that proteins carry out functions in the body. The logic followed that by understanding the sequence of all the proteins, we’d have a complete playbook for interpreting biology. Disease driven by errors in proteins would show up as errors in the genome.
So it came as an ugly surprise, as the draft human genome was assembled in 2001, that protein coding regions represented just 1.5% of the three billion bases. The rest was swathes of genetic ‘white space’. At the time, it was assumed to be junk — a genetic wasteland, an evolutionary afterthought with little relevance in driving the biology of the body or of disease.
We pressed ahead sequencing individuals with a particular disease, expending billions, in order to determine the genetic origin. Genome Wide Association Studies (GWAS) produced long lists of so called common variants or ‘single nucleotide polymorphisms’ (SNPs) — single base changes that can be shown to contribute to disease one individual to another. Name the disease and there are databases publicly available describing the relevant SNPs, often linking hundreds of thousands of patients to their genetic commonalities.
Then came the second surprise: 95% of the SNPs were located outside of protein coding regions, buried in the genetic white space. The majority of disease signal was lurking in this dark region, not in the protein coding region like we had previously thought. That said, in the 5% of instances where a SNP occurred in a protein coding region, biology became delightfully tractable. A substantial list of successful drugs and the genetics that they hit has been compiled. The GWAS method has worked in the instances where the genetics has aligned with something we already understood.
But 95% remained unsolved.
The human body contains a complete copy of the entire genome inside every cell (apart from red blood cells, which discard their genome once mature in order to make room for more oxygen). Three billion bases are present in every one of the thirty seven trillion cells in the body. Remarkably, each cell, regardless of its role, contains a full copy of the instructions for making an entire person. So how does the same set of instructions get read out in such an enormous variety of different ways to produce a skin cell, or a white blood cell, or a strand of hair?
Though the underlying code is identical, the way the genome is packaged in three-dimensional space is distinct and consistent across different cell types. Look at a muscle cell genome and you find one genome shape; look at a kidney cell and you will find another. The 3D structure of the genome essentially determines which proteins are ‘on’ (i.e. produced) or ‘off’ and silent on a cell type specific basis. Only a subset of all proteins are used, cell type to cell type.
What we find is that the non protein coding regions — the 98.5%, or the “regulatory genome” — regulates the protein coding regions. Imagine the genome as a tangled ball of string in the nucleus of each cell. Contacts between the DNA strings represent regulatory processes. For a protein coding region to be activated it needs to physically touch one or several sections of the regulatory genome in 3D space, but contacts can be millions of bases away from each other in terms of the linear sequence. What lies outside of the protein coding region is actually a vast, combinatorial regulatory network that enables the relatively few proteins — around 20,000 in human biology — to be combined and activated in the right cell types at the right times.
And we now know from GWAS that the regulatory genome is where most disease processes originate. Disease is really caused by aberrant regulation of proteins, not by errors in the proteins themselves.
Seeing the genome
Much like the genome, proteins also comprise a linear sequence — this time of amino acids — wrapped up into a three dimensional structure that determines the function of each. The shapes of proteins deliver exquisite specificity of function, and single mutations can make or break activity. However, relative to the genome, determining protein structure is a simple task. The average protein has around five hundred amino acids in a chain which fold up into a fairly rigid structure; companies like DeepMind believe that just with knowledge of the sequence and their AlphaFold algorithm, it should be possible to compute the resulting 3D structure of a protein.
The human genome on the other hand has three billion independent units. The number of feasible 3D permutations puts computational structure determination outside of the realms of the possible, at least for now, so we’re left examining biophysical techniques that directly measure the structure.
Most protein structures have been determined through crystallography, relying on the property of somewhat rigid proteins to repetitively array themselves under certain conditions to form a crystal. The diffraction of X-rays through the crystal can then be used to back-calculate the position of all the atoms. But the genome could never be crystallised: it is far too flexible to stack up neatly in a regular way. Neither can it be visualised with a light microscope, being far smaller than the wavelength of light itself. So it sits in an awkward space: too complex to be modelled using computational methods, too floppy to be crystallised, too small to be visualised.
Looking at a genetic study as a one-dimensional sequence of letters is like taking a video, lining up all the pixels from every frame in a single row and trying to figure out what the movie was. To truly understand the genome, we need to view it as a three-dimensional object: the nucleome.
Some companies are already developing new technologies for doing just that. Nucleome Therapeutics, for example, are able to measure the 3D structure of the genome in a startling level of detail. Nucleome believes that this process will unlock the true promise of the original Human Genome Project and make actionable the genetic data produced since. The function of the 95% of regulatory single nucleotide polymorphisms will become tractable, directly as a result of examining the protein coding regions they touch. When projected into three dimensional space, rather than being quite confusing, GWAS studies identify every protein involved in a genetically driven disease process.
For over thirty years we’ve been trying to make sense of a treasure trove of data from genetic studies. Scientists at the Human Genome Project and beyond did groundbreaking work in mapping the linear sequence, opening the door for a new generation of discoveries about who we are and how our bodies function. We’re closer to cracking the genetic code than ever before, but one-dimensional thinking will never be enough. That’s because 3D puzzles need 3D solutions.
Lachlan MacKinnon is an Investment Principal at Oxford Science Enterprises, focussing on investing in products built on enabling technologies in the life sciences. He has led investments in OMass Therapeutics, ONI, Spybiotech, Base Genomics and Nucleome Therapeutics.