Events Add an event Speakers Talks Collections
 
MLconf Online 2020
November 6, 2020, Online
MLconf Online 2020
Request Q&A
MLconf Online 2020
From the conference
MLconf Online 2020
Request Q&A
Video
Developing and Delivering Personalized Polygenic Scores at Scale
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
37
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About the talk

Nearly two decades have passed since the Human Genome Project was completed, fueling discoveries in genomics that have illuminated the complex story of the human past, and provided clues about the heritable origins of disease. Whereas some diseases have narrow genetic causes, the majority of common conditions – such as coronary artery disease or type 2 diabetes – have been shown to be “polygenic,” involving hundreds if not thousands of genes in addition to lifestyle and other non-genetic factors.

As we advance our mission to help people access, understand, and benefit from the human genome, 23andMe has been a leader in using machine learning to develop polygenic scores (PGS) in a direct-to-consumer model. In this talk, a scientist, an engineer, and a product manager discuss a collaborative project to scale up the infrastructure required to rapidly develop and deliver polygenic models into the consumer product at scale. This end-to-end analytic pipeline leverages 23andMe’s unprecedented genetic and phenotypic database, with over twelve million genotyping kits sold and over three billion survey questions answered. In this analysis flow, we first identify links between genetic variation and diseases or traits in a genome-wide analysis (GWAS). We then select sets of genetic variants linked to these outcomes as features in dozens of potential models fit in a parallelized machine learning service, promoting the most powerful models for immediate use in the consumer product. This internally developed software has drastically shortened development time (from months to days) and enabled the delivery of personalized results based on much larger models (20,000+ genetic features) to millions of customers in seconds.

The speakers will also highlight company efforts to improve genomic interpretation across a diversity of ancestral backgrounds. Unfortunately, the majority of research in genomics has included only participants of European ancestry, and some of the same genomic features that enable us to trace population migrations (and 23andMe’s popular Ancestry Composition tool) make it more difficult to generalize PGS developed in one population to another. For the first time, 23andMe’s end-to-end PGS pipeline can automatically test the performance of candidate models across multiple populations, optimizing our model selection process to ensure that we can deliver the most equitable product possible.

About speakers

James Ashenhurst
Health Product R&D Team at 23andMe
Shannon Hamilton
Product Manager at 23andMe

I completed a PhD in Neuroscience, focusing on the genetics and psychopharmacology of decision-making, impulsivity, and addiction. After my first postdoc, I had the opportunity to switch to industry where I could apply my skills to interpreting genetics through a direct-to-consumer model. I’m on the Health Product R&D team at 23andMe, where I use machine learning techniques and genomic statistics to turn research participant data into new reports and product features for customers. Being, in part, a medical device, this work is in close collaboration with regulatory, legal, and medical teams. Examples of my work in the 23andMe product are the Coronary Artery Disease, LDL Cholesterol, Atrial Fibrillation, Wake-Up Time, Motion Sickness, and Mosquito Bite Frequency polygenic reports. Most recently, I have been collaborating with the Machine Learning Engineering and Data Collections teams to help build and scale-up largely automated analytic pipelines to power future product features.

View the profile

Shannon Hamilton is a Product Manager at 23andMe, where she is focused on the building of machine learning infrastructure to power genetic discoveries used in in-app product features, research and drug discovery at the company. Shannon has worked as both a Product Manager and Data Scientist within the healthcare space for the last 8 years. Previous to 23andMe, she worked as a data scientist at both a healthcare-focused NLP startup (Roam Analytics) and an asthma digital health startup (Propeller Health). She received her Bachelor’s from UC Berkeley in Public Health and her Master’s from UC Berkeley School of Information in Information Management, with a focus in machine learning.

View the profile
Share

Hello. Good morning. Good afternoon. Good evening. I'm Paul McLaughlin. I'll be emptying the session. Again, I am at Ericsson, and I'm really happy to introduce James asking first and Shannon Hamilton Boston, 23andMe. And they will be speaking about either or giving a presentation titled, discover predict prevent developing and delivering personalized call Jack scores at the email. I just a reminder, the presentation will last until 12:05 when we take a lunch break. So I'm encouraged this secrecy about 5 minutes for questions. If you have any questions, please put them in the shop and I will

read them out loud at the end. Thanks so much. I'll take it away. And I'm going to tell you about a collaborative, effort, between scientists and Engineers to build machine running infrastructure, and support rapid development and enhancement of the 23andMe products. A little bit more about 23andMe. I'm if you haven't heard of us before our mission is to help people access understand and benefit from the human genome. And we do this through a direct-to-consumer genotyping kit, where you order the kit online. I told her to the mail to provide a saliva

sample to be processed, the DNA interpret the results and return the results to you through web portal. Because the report offer on spanned from more serious health, predisposition reports, which include FDA authorized genetic health risk report on serious topics, like Alzheimer's disease and Parkinson's disease. Also, MFD authorized carrier, status reports, which are particularly relevant for Family, Planning information on educational trade report and we can tell things like, based on your genetics, the likelihood that you have unibrow and a lighter Health Topics in the wellness

category related, to things like alcohol flush reaction or lactose intolerant, of course, our Flagship ancestry composition report where we can to make a origins of your DNA. Here is an example of a health prescription report. This particular one is about type 2 diabetes. So what we're telling you here is based on your genetic profile, the chances that you might develop type two, diabetes between your courage and some future date, like say t h. A t importantly report is powered by a polygenic, Christmas choir on this plan, will know where that comes from. And it's also powered by

23. And me, research of the algorithms that power of these results are actually developed internally and based on our research platform. But if you didn't know, actually has one of the largest genomic research Platforms in the world where our customers can actively opt-in to participate answer surveys and we can combine the survey responses on with a genotype and make discoveries of Association to try and understand some of the hairless with the origins of the heretical differences in the very importantly on Frenchmen, research is an opt-in

system. We're actively consents to participate and all of our research Protocols are independently governed by an IRB with oversight and approval over our Pro compass and processes and identify. But we will see. This platform doing is being part of this virtuous cycle where we create educated engage. Consumers are learning more about what their DNA can tell him about later. Topics were serious topics, wants to participate in research, provides survey answers. We can make

discoveries, which we then and turn around and put back into the product and provide more and more value to our customers in the cycle of discovery. And so far and we have over 4 million data points. Very, very large cohort of both common and rare diseases. So for example, we have well over a million people who talk apart history of diagnosis of major depression or we also have more rare. Things like this are really quite powered on to make discoveries of what what genes May underlie vulnerability to these conditions and we

can turn it around and in time provide Insight back to our customers. But in order to really understand the technical challenge that we need to solve it important to understand the science behind some of the genetic causes of horrible disease. What we're looking at here is a commonly observed relationship that we see between how common a variant is a genetic variant is the population and how big of an effect it has on the chances of developing bone disease, or outcome is how frequently a variance is observed in the population. So that might be a

change from a g to a tea at one specific location on a chromosome. Some of these are rare or common and looking at the effects are basically how big of an impact. Having that brings have on the likelihood of the outcome. The top left corner is a cluster of of disease Marion called Mindy Lee into the experience. You might recognize the name, Mendel Gregor Mendel in. That name is referring to the German monk, who is, and probably what you remember from genetic. So high school, biology

where Single genes into have a deterministic effect on the outcome and variation that in that, In Those Jeans, cause changes in his outcome. There can be dominant or recessive patterns of inheritance, but having one or two copies of the very means. You definitely will have the outcome example, would be something like variation in the cftr Gene. And cystic fibrosis are the ingredients in these jeans. If you have one or two copies definitely worth will result in cystic fibrosis to be What kind of more middle Spectrum? There are various that are

more common than population. The immediate effects of the likelihood of the outcome. As if I'm there would be a variance in the F5 Gene and how they relate to a hereditary thrombophilia up, which is a blood clotting disorder or common, some of them around your 1% of the population has them and I think the chances that you might develop harmful blood clots in your lifetime guaranteed. Laura category, we have a common diseases which are explained not by single variance of large effect. But many many genes of tiny effect. So far, conditions, like type 2 diabetes or heart

disease is explained by hundreds. Perhaps thousands of genes, Each of which has a tiny effect but in Aggregate and give us a picture of your bra. So how can we access the contribution and aggregate all of those individual facts into one? Would you listen what's called a polygenic risk score of the name? Implies multiple gene on the process of developing. These are transferring me. I'm starts with data collection and doing what's called a gwan, a genome-wide Association, study followed by a modeling developing

candidates and implementing those into the product where we can return personalized with estimates to our customers. Start with data collection on. We have research, participants who bought it in and they provide DNA and they have responded to surveys on various topics. We you and Polly control them to make sure that it's Katherine what we intended to capture footage. Capturing the disease concept that we want and spend a lot of time crafting survey language how to make sure that it's, it's really a legitimate representation of self-reported

disease. We don't take those those data and we do what's called a genome-wide Association study for Jee, what identifies genetic variants, that distinguish people who have the condition type 2, diabetes control and we do the training training stuff that we have. At this point can be around 1 .89 people. So we have about 150 million people who answer the question and we have on their genome-wide unit type of data. The result of a gee Watts, mostly a little bit more detail. Of how is conducted is is

measuring the statistical evidence for the association between every single locate location to be tested on the genome and they have but that doesn't capture the aggregate affects all together. So in effect, the resulting summary statistics on is what we use for feature selection to build are polygenic model. Do you want the stuff is not the feature selection process? The what what goes into a g lost? The use of the bread and butter of what we do. Very routine is 23 MI. What ingredients are having a genome-wide genotyping data for all of her? All the participants were involved? So the

ship that we directly test non-participant on house around five hundred thousand, six hundred fifty thousand genotypes spanning across. As you know, we have that many directly acid data points per person. And we have the option of making that are gases and then peuting out of a 45 million genotypes across the jhino are based on if you take Cymbalta to reference panel, so that means we have, you know, between 500,000 and 45 million genotype, per person across that training set of 1.8 million individual. Then do a series of parallel linear models to

regress. The outcome of the function of each genotype, potentially controlling for other demographic, factors like age or sex with, those are important. When we also included on genome-wide principal component, which can account for population of structure. But in fact, what this means is running massively parallel Computing with up to 45 million plus regression over a million people. So this is one step that part of a lot of computing power and then all of aggregate statistics, I put together and are ripe for figuring out what genes might be associated with you. We're looking at

here on the left, is one, way to display. The results of a gwan is called a Manhattan plot because it can kind of resemble a city skyline. So we are looking at chromosome position from the first girl is only up to the X chromosome, which is the 23rd on the y-axis. You're looking at the negative log, 10 of the P, value of the beta of the association between the variation. At each one of these locations across the Chino & the outcome or more of them on organic pattern, looking like, And as majorly driving on the heritable, individual differences in each other, and it turns out that

this actually includes an enzyme that is part of caffeine metabolism. So, there's a story there by Michael story, that makes a lot of sense. Other chihuahuas result in Manhattan, talk like that. So they really wants more resemble us a Skyline. So, here for hypothyroidism. We're seeing hundreds, maybe thousands of genes that has statistically significant association with the likelihood of the outcome. So this sort of a result that is very ripe for building a polygenic or so that we can identify all these very ever get all their fax an estimate on individual likelihoods given

individual shift profile. We do this by aggregating cuz I said the effects of the individual Marion's into the PRS played in with the floor. That's really is jointly modeling all of the different variants that we think might be involved if they aren't legit to progression. If it's a binary outcome, we might also have a door sex terms or maybe an inch square charm with there seem to be a relationship with age and all of his individual effects are sung together into the single score. I'm typically in the models were developing, these days include around a hundred grand.

The 25000 G of the gram. But an important question is, how do we choose which of these variants from would you want to be included in a model? For the future selection problem. And I were thinking about this is balancing the signal-to-noise ratio. We want to Max Lily, get all the predictive power of available from ever seeing, you're not in a lot. But there's also a correlational structure across the genome called linkage disequilibrium because a variance that are physically closer to each other on the jino tend to be inherited together. So

that means that variance of the next to each other tend to be highly correlated and bust their associations are both. Parents are next to each. Other are likely to be associated with the outcome. He likes about the signal-to-noise and reduce our future. So that were really maximizing the predicted path. Though they're a couple of parameters that we can perm you to nominate snaps together, like your type on her, cousin's. To be included as genetic features in the mall. At least can be what's the max value that we might consider to be included. Or what's the minimum distance that we might

have seen variance for both of them to potentially be included in the morning. So we bury a bunch of those hyperparameters. They might have a dozen to maybe a hundred different futuristic. That's that we then went past so you would use the training stuff. That wasn't what you want. And we test these dozen different models. Different does. Indifference is best and get faster performance in a validation set. Typically around a thousand people at this point. The speed at which of these pictures that performs the best in terms of things like an Ariana receiver operator curve. As if he

specifically look on and that way we can pick the specific pictures. That works the best and promote that for you swithin, the product. In the final step took to do the final assessment of how. Well, the amount of work. We have to attend a test that usually up about equivalent size. For the validation said, there was not part of any of the other she went out with you. So they were not part of that you want. They were not part of the model training. They're not part of the melting so that way it's less by us and less prone to cross over fitting. And the statistics are

derived from our assessment, in the tests that are what we report, and any white papers about the performances of these models, and in, any customer report information. We provided products. Finally, these project squares are interpreted into the consumer-facing report so we can do some of that the epidemiological statistics today, like people at a given percentile in the distribution in Nepal. Janakpur, have an extra Cent likelihood of developing type 2 diabetes by age and that's not the result of your

But another important point I'd like to make is that Twitter in me believes. It's important that everyone benefit from economic research and we do strive to make prf project scores that work for everyone. Unfortunately. I'm genomic research. I'm having a receipt problem. I live 2016 when listen to study came out majority of the data that go into genomic. Research studies are for people of European ancestry 2016 about you, 80% of the studies of all, people of European group being

that people have the opportunity to participate in research and to ensure that they reap the benefits and let the models all work. Well, for them to This is particularly a problem in the context of polygenic scores because models that are trained in one population. Sometimes do not generalize well to others. So say if you go to that entire flow of identifying a cohort doing the G wasp training multiple candidate models and parallel and promoting a model if only European

individuals were involved in that entire process. And then we test with palomado works out inside an African American population in population that might not perform as well. And why is that the case? Well, so that correlational structure and that I was talking about across the Gino mean, that the models that we are producing likely do not contain only cause ovarian that substitution of an a to a g and a particular Loca, might not actually change the function of the protein results from that Gene. That might not actually change the expression.

Something. What is it doing? Is packing a nearby marker that is actually the cost of variance. So given this correlational structure. They don't know if I can take a couple of errands and these correlational structures tend to vary between populations. So this very rich, the diversity that we see across populations is what we can use our things like ancestry composition make Optimus of where in the world. I'm Different segments of your DNA originated, but that kind of variation confounds and can obscure our bill.

Translate public risk of nation from one group to another. So, we're guessing this now, transferring me is to do automated population optimization at every step possible. So this means is considering if there is sufficient sample size, we could perform the G wants in multiple population. Say if there are about five thousand cases of tens be enough all combined results of those multiple if you want right now. LG summary statistics drive from from a meta-analysis tends to nominate more, robust variance from that work

better in various populations. Additionally, we can use trans population. My cohorts in training and we can optimize our feature sets and population specific validations best to make sure that any candidate model that we might try will be untested and every different population that might receive it and we can optimize which specific features that goes to, which particular population. Recalibrate. The multiplication. I'm just increase the accuracy. So now what I tell you about the scientific Challenger, if they think I'm going to

head up to my colleague Shannon Hamilton, who tell you about the engineering challenges, we Face to develop EF scale. Thanks babes. Can you hear me? Okay? And Hamilton, I'm a product manager at 23andMe and excited to talk to you about PRS machine. Effusive infrastructure. We built to be able to deliver these School models that James is building to our customers next line. So with your ass models. As a real, Cornerstone of Arkansas are scientists and Engineers needed at like a better faster

and more scalable way to easily train validate and deploy our models to customers. So we built a new machine learning service that we've lovingly named Gene. And there were three main goals for this work. So first thinking about it for the company product perspective, 23andMe just launched a new subscription plus product. And so, we wanted to be able to launch new. PRS powered Health reports more frequently to our new subscription customers. Now, looking for a man like Marvin in infrastructure point of view. Sorry. It's okay. We wanted to be able to build train and ship models in the same

environment. So the Rings main goal is to ensure that no complex Crossing handoffs between scientists and engineers and it wouldn't remove the need for time intensive model validation, that's required from training one model and research and then we get in production. They wanted to be sure that the exact model object generated in training is used for actual production number customer results. And so being able to do this would effectively shorten the development window, FRP RS model, down from months, 2 hours, which would be And we also want to be able to protect our customer results on

the fly. So this is important to us so that we could meal. Avoid unnecessary storage cost of large amounts of data. Removing the need to pre compute any of our results and ensure gdpr and PCI compliance as well. So now it's only a little bit about the why I want to jump into the house and explaining three ways that we believe Paris machine improves our development workflow. The first I want to focus on is the first point that we believe this machine really specializes in standardizes model tracking

for our scientists and engineers. In the way we do, this is via an open-source to all ml flow from babe Rex. And so, we all know, I'm sure, there's many data scientist in the room that experimentation is key to the state of science workflow. We want to try a different teacher size model parameters packages Culver's in, with the end goal of achieving the best Mott result. And so, I'm also really allows our scientist to track and store their models artifact, the metrics in a transparent, and organized way. So scientists like, James organize,

their modeling work into experiments and run and then the model objects that are generated are stored and referenceable for promotion and serving result results to customers as well. So the attic headaches at keeping track of Jupiter notebooks and only promotes transparency across the organization. Secondly, We Believe Pierce machine, really improves our model development workflow via any Blanco development by our scientists and engineers. And I we do this and Via metaflow. So at 23

and me, we believe it's really important that our product development comes from scientists and Engineers Building Together often in a single repository committing production base code. And in a way that really know it also enables, the scientific experimentation that's inherent to do science and medical Aurelia Naples that. So it's a python library to my Netflix. Also open-source, the really allows scientists and Engineers to string this modeling stuff together. And two were closed because of the python-based. The programming model super intuitive scientist. Oh, no need to

learn the workflow language is our operators and removes any friction to the adoption of the framework to something. We've been excited. Integrated are workflow. We are actually at 9 for the panel. So I just wanted to know that we do have a lunch break coming up, or I made for walking to stay. I know that you probably have a bit more to prison so I wouldn't hurt you. Shania. And I'll be happy to stay with you. And I read our questions. We may go to lunch break coming up on till 12:45,

and then we will be back. I wed another presentation. I can't remain anonymous. But so please I continue to Great. So this is just not to take my medicine a little bit. This is just an example of how we break up. Our pipeline, is James described. All these steps into individual stop functions within it. And then, lastly, and most importantly, no, a p. R s machine really improves. Our customers are model development because it enables the delivery of more robust and meaningful insights to our customers. And there's two important

pieces of infrastructure that we talked about briefly that allows us to do this. So I'm going to be called an interpreter and then and I think about our prediction in point next. So we'll start with The Interpreter. First, every model trains is got three of our pipeline, using pure ice machine is paired with an interpreter. So The Interpreter contains and result thresholds and then the complex satistics to a produce the customer facing result. Do it take the salt and of all, here's an example of what a customer might see on our website or enter app is over example. We just released a new

Health Report to communicate an individual's risk for coronary heart disease appear as model that powers. This looks first at, you know, an individual's genetic data and other data like age, next city, that predicts and outputs a risk or, or that particular phenotype. Alice score is a wrong. Number is super arbitrary scale and The Interpreter will take that number interpret, the rest of our and transform it into something meaningful. For our customers will see you like on a page. The actual risk of acquiring a disease, like coronary heart disease by age. 70,

communicated to our customers. Ultimately. And then, in addition to The Interpreter that endpoint design is another way that we believe allows us to serve in a more robust and meaningful, insights are customers. So that Paris machine interacts with our production repository is via the friction point. And so again, like I said earlier, in order to reduce storage cost, you know, optimized for gdpr regulations. We wanted to be sure that your ass machine is truly performing enough to computer customers result on the fly. So this means any time a customer visit to give him

report and website or there was the result is computed in real time. And this, though, the ability to do that is dramatically affected by how big the model is or how many different locations on the genome we use for protection as though. This why I chose a latency plot that shows how long it takes to get a result for a customer given the size of model. The Pink Arrow points to where our largest model is, is that currently? And so one benefit of building p. Ice machine is that we're now able to serve results for much bigger models. Now, some of our early reports in the app were used

only three hundred different locations, then she known for protection. And now, our most recent points are using up to 25,000 different locations. So the huge win, and we're really excited about our increased ability to generate more accurate models, for our customers and hopefully provide a more meaningful results. And just one more perfect games. Thank you so much for the presentation. I are there. Any questions in the chat? Feel free to put on some questions for

a presentation. Unfortunately, you you all may be competing at it with with lunch and talk to you and say thank you so much for joining us and have a great rest of your kin will be back after lunch at 12:45.

Cackle comments for the website

Buy this talk

Access to the talk “Developing and Delivering Personalized Polygenic Scores at Scale”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “MLconf Online 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “Artificial Intelligence and Machine Learning”?

You might be interested in videos from this event

February 4 - 5, 2021
Online
26
104
ai, application, bot, chatbot, conversation, data, design, healthcare, ml

Similar talks

John Whaley
Founder at UnifyID
+ 3 speakers
Christoforos Kachris
Founder and CEO at InAccel
+ 3 speakers
Anmol Suag
Data Scientist at Blueshift
+ 3 speakers
Martin Isaksson
CEO at PerceptiLabs
+ 3 speakers
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Vinay Prabhu
Chief Scientist at UnifyID
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Vanessa Klotzman
PhD Student at UC Irvine
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video
Access to the talk “Developing and Delivering Personalized Polygenic Scores at Scale”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
949 conferences
37757 speakers
14408 hours of content
James Ashenhurst
Shannon Hamilton