23 & Who?: Deep Research into our Genetic Code
How can machine learning answer the centuries-old question of who we really are?
Advances in genomics research have ushered in a new era of analysis into our physiological and behavioral traits, uncovering insights about our identity and its determinants. This research links our phenotype (observable characteristics) with our genotype (genetic composition) in hopes of identifying common attributes that allow us to better understand and predict differences in our personality and health. The human genome, however, is remarkably large – it contains three billion base pairs of genetic material, which if written out, would fill over 200 New York City telephone books (averaging 1000 pages each) 1. It is also remarkably complex: there are roughly ~20,000 genes and even more regions that control how these genes are expressed. Recently discovered techniques in machine learning offer a new opportunity to understand genetic differences at scale for the first time, providing companies like 23&Me the chance to drive consumer applications in precision medicine and genetic lifestyle analysis.
The Big Problem:
The Human Genome Project, completed in 2003, offered our first mapping of the human genome, identifying all of the 20,500 genes in human DNA, and determining the sequences of its 3 billion chemical base pairs 2. However, decades after its completion, scientists are still looking to establish more causal relationships between genotype and phenotype traits. Why the lacking developments? Two primary factors hinder our progress: scale and decoding complexity. To the former, small variations in genes and regulatory regions are ultimately what make each of us unique (and, unfortunately, result in disease). These small variations are very hard to detect within such a large data set. To the latter, manual genome sequencing is still very error prone and may lead to misread DNA elements.
The Cluster of Solutions:
To tackle these issues, 23&Me has used statistical techniques such as deep neural nets to understand correlations in genetics data, generate hypotheses for clinical validation, and standardize the genome reconstruction process. This is primarily done in two ways:
- Identify connections between genotypic and phenotypic patterns using unsupervised machine learning, a technique that clusters genes by their expression in cells and tissues.
- Improve sequencing methodologies by transforming the genome reconstruction problem into an image classification problem, which results in greater speed and accuracy 3.
Unsupervised machine learning clusters individuals within an unlabeled data set, indicating key traits that are consistent and relevant to the genetic factors driving our susceptibility to disease. 23&Me applies this method to ancestry data as well, mapping the locations in your genome that carry small amounts of information about your family’s geographic history.
Accordingly, their use of machine learning to improve sequencing methodologies has enabled improved accuracy and innovation. The genome sequencing process begins with high-throughput sequencing (HTS), a method that produces ~1 billion short sequences of bases, but are unfortunately not organized into a human-recognizable genome sequence 4. Deep learning methods facilitate a true genome sequence (guanine, cytosine, adenine and thymine organized into 23 pairs of chromosomes) from HTS sequencer data with significantly greater accuracy than previous classical methods 5.
Our Genetic Future:
Genomic applications in machine learning have enabled a wave of research and consumer innovation towards understanding how our DNA influences who we are. However, genetic makeup is only one piece of our map of human physiology. The next wave of innovation will arise from combining this data with other clinical research to build a true picture of health and disease. These partnerships can uncover additional findings, including how our phenotype expression changes within varying environmental conditions (e.g. lack of sleep, high levels of exercise) through comparison with other clinical measurements.
Machine learning applications in genetics research can produce generative innovations as well. By learning the characteristics of genotypes that produce desired phenotype expressions, we may begin to edit and change our own genomes to reduce our susceptibility to disease.
Genetic Philosophy:
While ripe with promise, machine learning applications in genetics carry deep, philosophical questions about the nature of privacy, data sharing, and the human condition. Critics push back on data sharing and commercial opportunities – should we allow 23&Me to negotiate contractual agreements to share data with 3rd parties? And should they be able to profit from their genetic analysis 6?
Perhaps more broadly, these applications provoke questions about what makes us human. If we are able to sequence and modify our own genome to optimize for desired traits, should we? Are we sacrificing our morality to improve our material wealth? Genomics research can answer many questions about our physiology, but these questions will require a deeper philosophical reflection into the essence of our humanity and values.
(798 words)
References:
[1] “NOVA Online | Cracking The Code Of Life | Genome Facts”. 2018. Pbs.Org. https://www.pbs.org/wgbh/nova/genome/facts.html.
[2] “Human Genome Project Information”. 2018. Web.Ornl.Gov. https://web.ornl.gov/sci/techresources/Human_Genome/index.shtml.
[3] 2018. Biorxiv.Org. https://www.biorxiv.org/content/biorxiv/early/2016/12/14/092890.full.pdf.
[4] Thangaraj, Andrew. 2018. “Evaluating Deepvariant: A New Deep Learning Variant Caller From The Google Brain Team”. Inside Dnanexus. https://blog.dnanexus.com/2017-12-05-evaluating-deepvariant-googles-machine-learning-variant-caller/.
[5] 2018. Permalinks.23Andme.Com. https://permalinks.23andme.com/pdf/23_17-GeneticWeight_Feb2017.pdf.
[6] Brodwin, Erin. 2018. “DNA-Testing Company 23Andme Has Signed A $300 Million Deal With A Drug Giant. Here’s How To Delete Your Data If That Freaks You Out.”. Business Insider. https://www.businessinsider.com/dna-testing-delete-your-data-23andme-ancestry-2018-7.
A great distillation and highlight of key issues in the field. Is it ok for 23&me to commercialize their data? Absolutely, but with the caveat that patients who have submitted their samples consent to this commercialization, and in cases where it is warranted, these patients should be compensated.
There are a number of academic medical institutions that have already generated large “Biobanks” wherein patients can consent to having their blood and tissue samples de-identified, but through which academic researchers can then access medical records and genotype data to build correlations between genes and disease. The real trick for 23&me, in trying to commercialize an opportunity like that, will be to identify ways to access medical records or other phenotypic data that will allow them to start building these correlations. In all of these cases the more data the better, but there is also real risk here. In many instances, just because a patient has a particular mutation does not guarantee that a disease will ever manifest in that patient. If someone then makes drastic changes to their lifestyle or chooses an aggressive treatments for a disease that may never come to pass, it may do more harm than good. Like any other genetic sequencing and analysis services, responsible disclosure is paramount.
“but with the caveat that patients who have submitted their samples consent to this commercialization, and in cases where it is warranted, these patients should be compensated.”
Is it realistic to assume that patients really understand what they are consenting to?
The philosophical questions you introduced about privacy and data sharing are certainly worthy of debate. The fact that 23&Me is able to share data with 3rd parties is incredibly concerning. For example, law enforcement agencies recently used data from an online genealogical site as a source to compare DNA samples during the investigation of a crime. 23&Me and Ancestry.com have both said that they do not work directly with law enforcement, but this could change in the future. I think the potential for a positive impact to solve previously dead-end cases is enormous and should be explored. However, the machine learning innovations in predictive policing in combination with access to genetic data from genealogical sites is worthy of deep concern from a privacy and civil liberties perspective. In either case, I am very interested in following this trend in the future. Thanks for writing about it!
Fantastic synthesis of the technical and philosophical components of genomics research. I believe it’s entirely possible our (great-)grandchildren look at sequencing as we do indoor plumbing: fundamental to basic health yet unbeknownst to our less civilized ancestors. While very optimistic for its application, I am curious about the distribution of this newfound power. Will genomic sequencing be like the internet, a democratizing force that lifts all (if not most) boats? Or does the potential to optimize for certain traits and eliminate diseases drive a greater wedge between those with money to burn vs. everyone else? And in the case of the latter, what role should government play (if any), when there’s too much power concentrated in the hands of a few companies?
Thanks very much for the fascinating contribution, John. I’m curious as to whether you believe it will be possible in the near-to-medium future for an individual to have any reasonable expectation of genetic privacy, beyond even whether they should have such an expectation. With recent developments using genetic databases, and advancements in techniques allowing researchers (or law enforcement) to identify individuals many genetic steps removed from the matched DNA, my intuition is telling me that privacy is a fast-fading dream. This is before we even consider malicious data extraction or inappropriate data usage from these databases. Should we be worried? or should we simply accept that genetic privacy was a relic of the pre-genetic age and attempt to ameliorate the worst impacts of such a new world?
To me the question of 23&Me’s profit over customer data is more intriguing than the question of privacy. While companies’ profits used to be a function of their employees, it is now shifting to be driven by their customers (and the data they provide) — what right does a customer have to this data?
To the question of our humanity, it feels challenging to draw a line in the sand now. Have we not always had an inherent drive to take certain steps to optimize that our offspring will be healthy and successful? Who’s to say that modifying or own genome is really taking that much larger of a step? On top of all of that, who’s to say we aren’t all just in a simulation right now?
I’d argue that there is at least a 1/3rd chance that we are in a simulation now 😉
Agreed on the simulation
Your questions regarding the nature of privacy and data sharing as it relates to genomics research apply to many other sectors across industries leveraging machine learning models as well. In these instances, I do think the regulatory landscape play an increasingly important role here. As data and safety becoming increasingly intertwined, protecting the user (in whatever context that means) is critical and should be a priority of government. Another thought to consider is a decentralized future in which each individual owns their own data and can choose to sell it to different entities for a specific amount under certain conditions. With the rise of emerging technologies such as artificial intelligence and blockchain, other startups have started to focus on this topic. I would be curious to see its application in genetic data-sharing.
I find the concept of gene editing to present a fascinating ethical question. Why stand in the way of progress that could save millions who suffer from genetic diseases? But where would you draw the line in gene editing? One day might we edit genes to live for hundreds of years? To create smarter or better looking children? To create children less prone to fall into the traps of our cognitive biases, or shallow and anachronistic way in which we even think about other humans as being physically attractive?
Would love to understand more about the challenges that unsupervised machine learning must overcome to actually cluster genes by their phenotypical expression – seems like there would be a ton of complexity given the way genes are inter-twined and interact with environment in their ultimate expression.
Finally (and most importantly) (1) Love that you are 1.2% Ashkenazi! (2) Your Italian DNA > Ukranian DNA… where is our Italian section flag?!?!