About the talk
With the advent of massive deep learning systems a world of applications lies ahead. However, how do we configure these systems for tasks that have never been investigated before? We are just scratching the surface when applying these deep learning models to healthcare and there is a world of potential. This talk details the journey of how we utilized deep learning to achieve state-of-the-art results on important protein classification and sorting tasks that were recently published in a peer-reviewed journal. The talk will also include discussion of techniques for adjusting the modeling workflow to meet the constraints of a healthcare use case and the importance of experimentation to achieving these results. Finally, the talk will conclude with a discussion around how this Stanford lab expects to utilize these techniques in future health science research and the impact this could have on the field.
Alexander is a PhD student in Computer Science at Stanford, supervised by Michael P Snyder. He leads a research lab in the exploration of machine learning in Bioinformatics and MedTech at stanford-health.github.io. Alexander has been featured in venues such as ICLR and Nature Biotechnology relying on SigOpt technologies.View the profile
Welcome. Did a presentation on the use of deep learning for proteomics? What we intend to do for the future of medicine and how we believe optimization is critical for that. Now, this is a story that I like to share with you is the resort that we've been conducting info shopping. So I said, PCS, PCS 3rd, and 2nd near Stanford University. We've been looking into how can you install fundamental challenges ossification challenges? So, how is this related
to proteins? Well, let me take a true language modeling and what technologies that we've been using in line with Melvin sofa and how he's been able to transfer proteins, research and always be able to get state-of-the-art performances. This is a classic example of how to take something. That's a bit Mall message. Sorry, ended and then applied to a field like proteomics. The first technology that will allow us to make massive gains on proteomics, but if you don't, I'll take you to a
quick tutorial. If you have a sentence, so just insulin should be taken the, how do you get the next one? That's the key tasks in language and you can stay a bold thing in the morning to take a lunch break and but you can also have other words and you have to hold the line with modeling is calculating distributional. What is the next 1 p.m. Classically? This has been done with m g. This is been research for decades using & Friends Model and statistical learning algorithm. Why you
would go in and you stay? Well, give me. We have to send some sweats. Just look at the last word of the two, last words and say given us, what would be the chance that I would see morning after that. Double Tree, runs into it, or not, an issue here, by taking multiple words into the consideration, is that you start search baseball explode. So you going to end up if you stay well, I will have all fought for combinations of what before I guess in the morning. That's going to be a lot of combinations of what and you're going to have models. That might
be possibilities of parameters, statistical, statistical parameters that you have to go through this scale beyond for 5 p.m. And you have to do a lot of optimization for that. Now, this is where people wanting comes into the picture. Saudi whining is the process of using function approximating algorithms, and in particular, the type of recurrent neural network is able to have a state that is updates. And in linear time is able to go over a sentence calculate a sense of representation and then use it to guess. Next, what?
I know about 10 years ago decided to be useful language modeling and today it is state-of-the-art Technologies to use deep learning language modeling and it works surprisingly. Well, especially when we start to use them. If you smash up noodle networks, I'm so they can ride entire fake blog post. That's a lot of worried that it might be used for this information making Twitter Bots Redbox and whatnot because it's by this point in time actually difficult to set them apart. From real humans. And I think a really cool example of that, is this
example, he had done by openly on the open and I said, what if we take the biggest model that we could imagine and we build this model by looking at all the data we can find on the internet. And then we give it a prompt and then we just make it guess the next one. Now they were able to make entire block those based on this. I want to see him. The top is the prompt that I used. What you see below is the date of that did generate is the date of that, this language model generator. Just from just brought, it was able to take this and then guess the
next one. And then, guess the next one after that. And so by interactively guessing the what, I said that it was able to generate this entire paragraph and notice how I was able to pick up that while given that this shock and finding scientist discovered, a few to call swim in a remote on a Ford Valley in the Andes Mountains fall. It was able to get to the conclusion that it was obviously found by dr. George Paris at the University off La Paz, which is located in South America. So I think it's surprisingly, is surprisingly. Well, able to find the content
and context. Off the sentence and then use that in order to generate this entire document. And now it's that exact table to Knology that we are looking for. When we want to understand a much more complex language, which is the language of protein. And I should see here is state-of-the-art and almost all natural language processing today. I made by you smokes understanding to call fundamental concept text, and how to communicate able to build applications on top of that. And what we want to do is that we want to go into proteins
because we have and what the text we have lashed elections approaching. We have unit products, database with hundreds of millions of protein, sequences and proteins themselves, be seen as a paragraph, proteins are described by about 20, amino acids with string together for about an hour, 300 characters, which is the length of a protein. And I can be seem to be about the same size as a paragraph in the document. And because of the similarity in the input, he will start to think, like, hey, Will could we use this exact same technology to find the underlined
syntax and semantics a protein. So in 2019, he brought Institute, we came out with Junior app, which was taken some of these early language models and try and do the exact actually with little to no modification to the original natural language processing code. They were able to trade this language model, could extract the semantics and syntax of proteins. And beyond that researchers have started to use more advanced, most state-of-the-art modern natural language, language modeling tools, such as you might have heard
us. And Facebook csm1 be able to generate context of proteins and probing these different context. It was able to find a whole array of syntax between proteins. For example, what are contact Maps, what are secondary structures? And using this in a very similar fashion to how we see in an old piece of natural language processing, you would take your understanding your high-level understanding of a sentence of text and you would stop in order to predict a specific natural language processing tasks. A doctor might be the sentiment of a sentence. It could be
question-answering. It could be a summarization of a document. And we thought why don't we use that exact same type of technology in order to predict in this case were, gene expression? And it turns out that we can significantly improve performance off of, gene expression by using this line with muscles. And this is actually in production today at companies and just got published by all ability to take some arbitrary broaching tasks. Find a semantics off the proteins, the underlying structures, mod not hidden in these representations
and then predicting sudden phenotypes of interested in in this case whether or not they were expressed. We also found looking into what features do we actually find? We find out that a lot of these two. What is he has that all of the parable. Those are peaches from Junior. And all of the green and all of the other colored dots, a certain features approaching such a coat on amino acids, symbols, solubility parameters. And so full we find a dose of correlation
between specific protein features. ISO protein features that would call it highly together with these degenerate features meaning that it actually catches the underlying physics of the protein. Just by predicting the next character for poaching. And I think that's very interesting by predicting the next character of a protein. We can actually find the underlying physical information about protein. And not only that, but we found like looking through the specifics off the protein language model. So this is secondary structure.
We find that the Regents would secondary structure. It was able to have a much higher accuracy in a much lower variability in his prediction. So specific regions is that to have a lot of confidence around what I mean? I should be this week in switch on the way. And this year is say, so going into some of these early task would be looked into. Okay. What type of features? Can we extract from protein with models? Let's take some tasks and see whether or not this can do well and I'll post a comment. Expression is not a massive
and Devin within the field of proteomics. So we thought. Okay. Why don't we take one of the most popular. Renfroe, charmx, 20,000 citations? Why don't we try and take this language model and see whether not we can improve performance on this very important. Ochman, proteomics. And what you see here, is that it did it extremely. Well. The Orange Box is in the top left, the performance of the new tool signal P6, which is based on language models in the exact same way. All of the different in a pizza with a
baseline Which models you take a line with model. This case is building proteins and then you find unit for specific tasks, in our case that was predicting signal pets. Know what we see here. Is that for most of it for about half the tasks tool worse about the same. I may be a little bit better. What are some of the toss? The two were significantly better? And what we find is that these are offering to tasks and he's an awesome photo of the stock pots, where we don't have a lot of protein information. So what you see here, all of these different
parts of different tasks, with them signal, peptides prediction. And for the ones we don't have a lot of data available. This tool works surprisingly well and not just that, but what is he at? The bottom left corner? Is that Saudi Visa? What is the identity to the training set? How similar are the sequences that we are predicting on in comparison to how similar to a training set up the test, validation training, said Ola and as we get up smaller and smaller or lap, we see that using his language model, still work very well, which means
that they've generalized a lot better than the way before. It was built on a by Alaska model. We able to get a lot better, join my station and also the whole process was a lot simpler because we just have to find June do slime with bubbles. In this endeavor, for this signal T6, which was just accepted to NatureBox and I'll let you know what we find and so at least it's stick up in comparison adding Sig up together with line for tomorrow. Makes you Sandeep one in very convenient and very easy because
we simply take these pre-packaged deep learning networks. And then if you have a friend us that we have to optimize, will you stay up so late? Last snowfall. Are we run? The greater the sun? Will you sick of that? And what we find, what we find is that it simplifies the process and also used to give us one to 3%, performance boots, because it simply goes into a lot more depth with a high performance optimization, then we would be able to So now we have seen how we can use computer science and artificial
intelligence in order to solve some key challenges in basic science and pro shop. And I think our next question was naturally, like, okay, what are other major challenges that we with the use of computer science? And I'll to my station, not sufficient intelligence can make even better. And a key question. Here is the future of Medicine. Now, what we've been seeing is that we seen that has been a increase and diabetes, chronic illnesses and medical bills. The last century.
And we believe that a lot of this might be related to change his lifestyle the week. Do we have had? The issue is that finding gold. Animals fighting samples of humans. Do not leave them out of my house just to be commission. Differences is today very difficult. And also the key question Here Comes, how do we measure differences? How do we measure differences between humans? How we able to measure something as complex as the human body? So this is a motivation for the future medicine. So, we would like to collaborate. With people who lift free industrialize
Lifestyles and we would like to understand. How is a natural human? Both look like what is the shape of the kidney off a human wizard? Modernized? What is the Skull shape of a human who is not modernized? And now what we know from a logical research dating over a hundred years back? Is that the bodies? The skull shapes of ancestors was significantly different from what they are today. And it is a hypothesis that not a lot of this has to do with genetics, but most of it is
epigenetics. And so, in order to back up those hypotheses. We would like to go out and actually do the measurements. And so, this is where computer science and optimization comes into the picture. Throw in the future of Medicine. We have a whole set of different Technologies. Full body scan, what you can even do a scan of your body with a light us can on your iPad. We have microbiome samples, which we know are very important most of the DNA in your body.
We know that certain things such as mouthwash. What not actually significantly as a distribution. We have my class a technology. So a lot of these Technologies at acknowledge that you can somewhat easily disabuse flash amounts of people in all the technologies that we interested in taking a backpack and go and visit some of these humans. We also interested in fitness trackers for the future medicine. How can you mess your heart, your blood sugar? Sleep. Sleep is a
core component of what it needs to be a modern human. Most Americans today, gets sick, Celeste also sleep. Well as we know the optimal sleep when I do is between 8 to 10 hours. And as far as new technologies for them to spill. So just on a sequencing. And so she challenges existed would be able to measure a human body and being able to measure bodily differences. Because for all of these different devices access to the data is actually she challenge. How are
you going to get access to the data? You Apple watch data. And what about you? Am? I scan journals? You light a scan from your iPod. How do you order a 9s? Even if you believe that some things that need to get staples in your butt? Now, a lot of these Technologies are not really available to the Rock consume. A lot of them are Bali available to clinical researches and a mini fuse Technologies, especially available devices. You need to strike, special deals with Fitbit, wet things, and whatever manufacturer might be out
there. And also many of these devices to build a business model around taking your data. And selling that data to third parties. The insurance company and it turns out that that's a lot of valuable information that they can extract from you. Later. we see the same challenges over and over again with big tech companies that uses your data in ways that you might not wanted to use you there. And it's going to be even more. So the case with these devices because there might be a future where it's casual. It's normal
to walk around with 10:15 devices on you today. We already have multiple. And what happens when you dated them? Not what we're very interested in is to be able to build and privacy ensuring database to build a database where you can save all your data from these different devices from your asses and stored in a place where you can talk on the Privacy options yourself. Now, when computer science students today, get talked. She wanting Digger. Talk, natural language processing or computer vision because we have
easy access to these data sets. However, we have so much healthier out there today, most likely way more healthy than we actually have this other topics. However, the issue is that you can't just Define a web scraper to Health Data, because all of this data is stored behind company walls installed in Haitian journals that some people might have access to, its most likely, the people behind the company's third-party vendors access to that data. And maybe even a few Peace Tea, students at institutions.
So now what we would like to is to have a database privacy ensure where any researcher and go in Access that data in a private insurance, Manor. And come to clinical confusion, and easy, and normal to run, a research studies. The concert because a lot of issues today and why? Nutrition and understanding bodily. Differences is very difficult is because we have a right array of publication. V a different banks. And also the challenges that they usually only runs excellent. He's on a small population of people because it's very expensive.
But even though we have today. That today is just not using her, they're not able to use it. So I'll hope by creating a database Iverson trim database where you can include all of this data. Instead we we can get some better conclusions about what Optimum help us. And also some key challenges with respect to this. Ensuring privacy and possibly being able to extract the data. Is optimizing. When should you do what? And this this is some of the optimization challenges that we are very interested in because some while some of these assays might be inexpensive,
the RNA sequencing which is one of the coal. The most exciting Technologies today. You can measure what's a 15000 different types of iron in your blood without a sequencing. One of these samples, Microsoft wants a $4,000 to Banana eyes. so, if we are to MB available to the public and also just a clinical studies. Make the normal thing that you would be able to analyze your data and get the clinical feedback on it. And we need to understand least optimized
run on a simple thing. When do you need to get an? Am I going to run? Today, a lot of insurances will give you a yearly health check. Is that necessary? What should you have done for? This really hot check. Should you get some Minecraft samples? Can you get a lighter skin off your body, scan of your body show you something that might have given the way we do medical treatment. and this is when we believe that tool such as stick up a pivotal. Because we need tools that can help us. Optimize when to make these decisions. How do we balance
exploration vs. Exploitation? And if one sample is $4,000. And that's a lot of money. We can save. And a lot more people, we can improve my studies if we can probably balances happen. Thank you for stopping by my presentation today. If you'd like to know more about a research and the future of Medicine. Please visit our website at Stanford Dash help. Get up at Isle of Stanford as hell. I'll thank you.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.