Events Add an event Speakers Talks Collections
 
Duration 19:52
16+
Play
Video

Reproducible computation at scale with {drake} (Will Landau)

Will Landau
Research Scientist at Eli Lilly and Company
  • Video
  • Table of contents
  • Video
R/Medicine 2020
August 28, 2020, Online, USA
R/Medicine 2020
Request Q&A
R/Medicine 2020
From the conference
R/Medicine 2020
Request Q&A
Video
Reproducible computation at scale with {drake} (Will Landau)
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
406
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

Okay, so our next speaker is going to be will Landau and he's going to be speaking about reproducible computation at scale with Drake. And this is will his night be pre-recorded this for us. So he will be available on chat to answer questions for you during the talk, as well as there'll probably be a few minutes left after the talk to finalize any questions start. Thank you all for coming and thank you, sue. Our medicine for the opportunity to speak today. Science has we developed and vicious computational

workflows for statistics and data science. There's a lot of bees in analysis and machine learning and simulation at prediction and we need to think about those efficacy and reproducibility right from the start, Many of these projects require long run times methods. Like marcacci Monte Carlo and deep neural Nets are computational expensive and it could take hours or even days just to fit a single model that's fine if you're only going to run the project once or it regularly scheduled predictable times but if the code is still under development and you're making a constant

stream of of changes, several men in real time, it's easy to get trapped in a vicious sisyphean cycle. Large workflow usually has a large number of moving parts. We have data sets. We want a free processor simulate, analyses, of those data sets, and summaries of those analyses. And if you change any one of these parts, whether it's a bug fixed or tweets that model or some new data, then everything that depends on it is no longer valid and you need to rerun the complication to bring the results of back up today. This is seriously frustrating. When you're in

development and changes are coming real fast. You're making, like I said, several a minute updates to the code reader. New artifacts all this in real time. And if every one of those changes means, you need to rerun the project. There's no way the results can keep up. Unless you use high point tool, their pipeline tools for production, which resemble, Apache, airflow and their pipeline tools for development, which resemble if you do make. And today, I'm going to talk about the clay tools because those are the ones I think are designed for this part of the process. It's an

action-packed space, and there's a lot of great options. But unfortunately, there's not a whole lot for our, And that's where Drake comes in. Drink is a make-like pipeline tool, that is fundamentally design for our. You can call it from an our session. It supports the clean idiomatic ocean oriented style of programming and it helps you store and retrieve the results. Most importantly, it gets you out of the sisyphean Loop of long computation and enhancers reproducibility. And it takes a lot of the frustration out of dating sites. Let's go to an example. So

I'm part of the capabilities team at Lily and much of our work revolves around the design and simulation of clinical trials in the first several months of 2020 weed. Helps design several trials for potential new treatments of covid-19. We use simulation to assess the operating characteristics of these trials and help, determine features like sample size, primary endpoint and even when the trial should stop, this was a multi cross-functional multi-disc, disciplinary approach with strong statistics leadership in that mix,

This light has a mock example of clinical, trial of a clinical trial. Simulation study. It's not the actual simulation study for anyone real life trial in particular, its oversimplify prophetic pedagogical purposes, but it does represent how my team and I set up the computation for this General kind of problem. We use drink a lot and we use Drake a lot in this way. So, this is a mock Phase 2 trial and the goal of the simulation is just understand the trials, operating characteristics. So, when is this trial going to clean the therapy works when it's going, when is

it going to claim? The therapy doesn't work under what situation is going to make each decision? Most of the time and we want to design a trial that makes the correct decision. Without an unnecessarily, large sample size. So, one of the things we might pay attention to it as well or is a group of 200 patients. A good, a good enough. Is it a large enough sample size so suppose we want to enroll in, Yulee hospitalized covid-19 patients and measure the number of days until they're cleared to leave. We randomize in the simulation have 2 patients to treat sibo and measure the

Drug's ability to shorten the hospital. Stay Or the end of the trial, there are multiple pre-built criteria to determine whether the therapy moves on to phase. Three studies including patient safety advocacy cost-effective and more but suppose we meet the efficacy Criterion. If the posterior probability that the hazard ratio of Hospital discharge exceeds 1 1/2 is greater than 60% reassessed the design of the trial with the simulation that the bottom of the slide first, we draw time to event data for each simulation study

from the distribution, then we analyzed the simulated data and evaluate. The efficacy rule using a Bayesian proportional, hazards model and we repeat for many simulations. We aggregate the results to figure out what the efficacy decision of the trials is is likely going to be under different effect size. Scenarios So that's the background. And how do we implement this? Let's have a look at the file system of this project. We have our scripts to load. Our packages are custom functions and something called Drake plant, which I'll get to later. You also have an underscore

Drake. Our scripts to configure and set up the workflow at the top level and some other top-level run the scripts for just for convenience. We also have an sge template file as GE stands for sun grid engine and this helps us distribute the workload across multiple nodes of a Green Engine cluster. Most of the code we write is going to be in the form of custom functions. Now, this may be unfamiliar to lot of folks were used to writing imperative code numbers, scripts or ever just putting everything in a

bunch of R, markdown reports functions scale, much better for Big Stuff. A function is just a reasonable set of instructions with multiple inputs, in a single return value. And usually, those inputs are explicitly defined and easy to create and usually the function has an informative name. Functions are a fundamental built-in feature of almost every programming language and they are particularly well suited to our which is designed with four more functional programming principles. The most obvious use for function is as a way to avoid repeated code scattered throughout

the project. So instead of copying and pasting the same everywhere you just call the function functions are not just for What should we use their also for code? You just want to understand functions are custom short and they make your work easier to read understand and break down into manageable pieces to document test and validate. And it really helps call sturdy, the reproducibility and reliability of clinical research, Most of our functions revolve around three, kinds of tasks preparing analyzing data sets

and summarizing those analyses. And this is one of the top level functions of native bees it except he's using to generate design parameters as arguments and a returns at it. Dataframe of simulated patient level data inside the body, cause another custom function called stimulate wish we could find elsewhere in the functions file. Another custom function is called Model Hazard and actually fits the model and it uses custom functions run chain and Samurai samples to generate a one road tidy, dataframe of the results or single simulated trial.

At this point, you already have something to take away and apply, even if you decide not to use Drake this function oriented style, still has a lot of value. However, is your thinking about using Drake, and converting to functions is almost all the work that's involved. Once you've done that, you're almost already almost that you're already almost there, and all you need. Now is to outline. The specific steps of the computation how those functions fit together in an object called a drake plant. And this is how you define that plan. There's this Drake plan to function and inside a call to it,

you list out steps called targets. Each Target is an individual step of the workflow. It has an informative name, like some more patients and it has an R command to invoke the custom functions and Richie Valli with the N Drake has Trend to Define entire groups of targets. So I'm jumping right into this patient's death. So because of this map and the people's maps in Define, we just find a patient level data set for every simulation repetition Later on the plan, we have a Target to. We have targets to analyzing data, set Summarize, each affects, I scenario and combined them. And

at the end, we combine the results in a single readable date. Drake plant function doesn't actually run any of this work just yet. It's Sibley returns, a tidy dataframe of the work. We have planned a broken down the work slowly to Target. And we do this because we want Drake to be able to skip some targets. If there are you up to date and just run the ones that need to refresh and this is going to save us loads of run time. It's always good. Practice to visualize the dependency graph of the plan before you start, the Drake has functions to do this for you and

it really demystifies how Drake Works security. See the flow of the project from the right beside how many simulations were going to run, run those to get to generate the patience or the models and then summarize them. But how does Drake know that the models depend on the patient's, be the order of the tart, the targets you right in the plant, doesn't actually better since you ate it resolves, this dependency grass, because notices the symbol patience is mentioned in command of the models Target in the plant. That's white. In fact, we get one model Target for each patient level dataset

because of the dynamic branching. This in this plant, the drinks, skins, your commands, and function without actually running them in order to look for changes. But you also understand what parts of the code in which targets depend on one another, and this is called Static code analysis. Put it all together, we use a script called underscore Drake. Are we load of packages function, to plan. We said, options to farm out to the cluster and we end with a call to Drake to put all this together. Actually brother were Khloe, use a function called are make, so this creates a new

clean. Reproducible, our process runs the underscore Drake that are filed populate the recession. The result Integra runs the correct Targets in the correct order from that dependency grass and right to return values the storage. Process greatest distributes targets Across the Universe or Torque or just the the cores in your local laptop. The drink automatically knows from the graph, which targets to run in parallel and which need to wait for dependency. So you don't need to think about how to

paralyze your code. You can just focus on the content of the methodology. Afterwards, all the targets are storage. There's a special key value store in the, in the. Drake folder and the project route and Drake has special functions to retrieve the data and three caps tracks, these artifacts as ordinary, our object. So you don't need to worry about how to, how to store files. Takes care of the file menu for you. Here, we have the first round of the operating characteristics. We have a strong scenario at the top that assumes that drug Cuts hospitalization time and a half at

all the former but not the latter, which aligns with our prior expectations. Any sign that the code is working, but it's not useful yet because it only States the obvious. So we need to add more scenarios to understand the behavior of this trial. So practice we reach out, cross-functionally become the literature of the disease state to come up with me. So maybe we reset an additional with exercise scenario, right? At the effective interest or depends on the situation. So in any case, we add a new scenario

in this case by going to the plan and proposing a new at exercise and right away, Drake understands that we've added more Target and the previous ones are still up to date. That's what this graph shows you. So when we run our make again, only the new scenarios actually get computed Drake skips, the rest and saves us a whole water. Runtime 2018 models, we don't need to send this. This Behavior skipping steps are already of state has really helped my team and I especially during the the the high-pressure exciting

fast-paced covid-19 work, we could do it to our simulation studies with Terin times of an hour or two hours and we just don't have time to rerun all those previous analysis. But we need a reproducible end product because it's a serious. Research and it's going to affect the lives of patients. So we need to move fast and we need to make sure that were hit were doing things correctly and Drake allows us to do. Final results, automatically updated with the results we have so far. And that new scenario is

now in the middle. Now, I didn't show this also takes into account the function. So if I were to change a function, that would automatically invalidate the results Downstream, and the target. At the end of the day, drink can tell you. It's all your targets are up-to-date. And this is tangible evidence. Your output matches the code a date. It's supposed to come from evidence running the same code would get the same results as a huge piece of reproducibility. So you can learn more about Drake and the online manual, the reference website public

examples in the online, Workshop be online Workshop. I taught it at other conferences but it's also something you can run on your own in the cloud to sign up for rstudio cloud and going to web browser, and you have everything you need to get started. I'll many things to be our community. Specially are open side for the virus discussion and widespread adoption it made Drake the package, it is today and so many people have contributed. It's a flight for me. But problems I didn't know existed and the active participation was incredible fuel for development over the past

four years. Drape is pure review, Darwin's I package. And if you would like to share of your use case, consider reaching out that are open side. Org, use cases. So we have a few minutes for questions will has been answering a few of them in the chat but there's a few other ones that we will do. Well, are there plans to make migration to Drake easier? Migrating from a large are script to Drake's format, one expression for parameter can be a source of friction. That's a great question and I get asked that quite a bit actually. So LSU's ask help me

out with with a with a feature a while back called I believe that the the the function is code to function in in Drake and it helps convert and it's a single scripts into a function and in a way that is compatible with Drake, in a sense that you can insert a script into a drake plant as a Target. There is a chapter in the online, manual about that. If everything is, is in a script then in and it doesn't make sense to include that as an entire Target on its own. There's actually

It it may take some manual disentangling and there's actually a package that was recently reviewed in on board at our open cycled are clean. That helps inspect, and disentangle different interconnected parts of that script to either break it apart, and it in two different scripts on two different functions. And so we have time for one more question and then we're going to move to the next session. How does Drake get along with shiny? It depends on what you want to do. What what purpose Drake can has in that in that interaction. So,

most commonly I would say, is is a situation where you have a Target at the end of a, of a drake plan, that deploys some precomputed work as a shiny app. So maybe you have a great plan to do this long computation to build a, a final data set and you ship that dataset along, with the shiny object, to any apps. I owe or rstudio. Connect my team and I do this with slides quite a bit, and there is some of these cases with, with shiny. I think that's the most common one, there are there other kinds of interaction, but it really depends on what you want

to do with with shiny. In this case. Great. Thank you will and we're going to move on to our next speaker. Very exciting.

Cackle comments for the website

Buy this talk

Access to the talk “Reproducible computation at scale with {drake} (Will Landau)”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “R/Medicine 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “Medicine, Health and MedTech”?

You might be interested in videos from this event

August 18 - 20, 2020
Online
6
40
bud, compliance, covid-19, hospital pharmacies, pharmaceutical compounding, preparation, science, stability testing

Similar talks

Corey Fritsch
Applied Data Scientist at UW Health
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Qinyun Lin
Postdoctoral Researcher at University of Chicago
+ 1 speaker
Kenneth Frank
Professor at Measurement and Quantitative Methods
+ 1 speaker
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Cass Wilkinson Saldaña
Data Instructional Specialist II at Children's Hospital of Philadelphia
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “Reproducible computation at scale with {drake} (Will Landau)”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
816 conferences
32658 speakers
12329 hours of content