Events Add an event Speakers Talks Collections
 
Duration 18:38
16+
Play
Video

Build Your Own Universe: Scale High-quality Research Data Provisioning with R Packages

Travis Gerke
Data Science Leader at Moffitt Cancer Center
+ 1 speaker
  • Video
  • Table of contents
  • Video
R/Medicine 2020
August 28, 2020, Online, USA
R/Medicine 2020
Request Q&A
R/Medicine 2020
From the conference
R/Medicine 2020
Request Q&A
Video
Build Your Own Universe: Scale High-quality Research Data Provisioning with R Packages
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
218
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speakers

Travis Gerke
Data Science Leader at Moffitt Cancer Center
Garrick Aden-Buie
Ph.D. Candidate at University of South Florida

As academic faculty, I led cancer-focused research teams in the application / development of tools for applied machine learning, causal inference, and biostatistics. In the post-academic setting, I direct data science efforts in cloud-based informatics and advanced analytics with a focus in the healthcare sector. A cross-cutting theme across these efforts is the use of R for data science, particularly towards transforming real-world data into real-world evidence.

View the profile

Garrick Aden-Buie is a doctoral candidate in the Department of Industrial and Management Systems Engineering at the University of South Florida. He received a Bachelor of Science in Applied Mathematics and a Bachelor of Arts in Spanish from Lehigh University in 2007. His research interests include data mining and predictive analytics for healthcare decision support systems, with a focus on patient-specific predictive models.

View the profile

About the talk

This video is part of the R/Medicine 2020 Virtual Conference. (Travis Gerke, Garrick Aden-Buie)

Share

Great, everybody first. Thanks to the amazing organizers of our medicine. This has been quite an experience. There have been few. If any technical glitches on Travis Gerke, I am Health informatics director and scientific director of collaborative data services at Moffitt Cancer Center in Tampa, Florida, electric swing with those groups are in a couple moments to talk with Cara Caden Bowie. Some of this will go is all talk to my slides. I'll finish my hand off to end it to wind things down. So since I don't own the last light at the end

of the first, you're about to see some really cool flies and I can personally take credit for effectively. None of them will know that he is really a community Wizard and Visionary, along, making sharing the slides really, elegant and interactive, and awesome. So, all the cool things about to see a little bit and I took it and ran after this song ends of Personnel Resources and package value from maintaining and developing any institutions that are related Giant,

Once Upon a Time organization, conduct an all-day to related business and Amorphis Cloud known as the IT department. This is a common Paradigm for many Healthcare organizations, in early stages of dating maturity. IG farben. Had many roles. Hospital operations needed dashboards for planning purposes. It had someone that could do that. Research is needed patient for my assessment data for the RV proof, protocols. I can also have somebody for that. somebody, of course, he didn't know about the lineage, as well as coding or meditative standards if they have the

data got here, and why it looks the way that it does, that was an it An organism databases into warehouse in granting access has is of course, important to mention. Some of these teams were operating at a scale, which we better be situated as an independent entity of the it, gravity field. One of these is business intelligence, that person who's making dashboards for operational ID. Non-research staplers is not part of the larger team to create such products as scale. Next, a research Focus ticklers. I'm in the same needs as the operational. End users such as recording

dashboarding and importantly date of provisioning. The Twist of course in research base. Is it? Such activities must be conducted in accordance with our be an ethical approval and City. Design feasibility. As a relates to date of availability and structure requires specialized training data, science biostatistics, epidemiology Apartment data services can operate, its tail and backing, either a critical and complementary, team didn't quality standards form from. I T's data

store in person, then sure the date of dictionaries or a bust and data lineage is understood by the business intelligence Services teams, appropriate Downstream usage. As institutional data, assets grew warehousing and access rules became necessarily complex challenge. Now it's so many teams completed a delayed operations at a rapid Pace. We needed a shell crafts, to coordinate technology, strategy, inform General data, governance and mine valuable soccer or from the asteroid belt that joke. I know

you're laughing an asteroid belt, make their way to the appropriate groups and shared their Institution when is Tool's ready for placement and maintenance and institutional support of production environment, the new applications development land, mass the nights, he can help out for example, they maintain software such as ours to do server or get have Enterprise This whole story literally with some shortcuts for clarity mirrors, the rise of the chief data officer roll across healthcare industry and eat all these groups

tend to roll up or is vertical. Take me to get over. This is our first hint of scaling, dated provisioning isn't just about stealing data. It's about stealing. The people who are doing the provisional and part 2, here is going to tell you more about the, how So as Travis talked about stealing provisioning, by sealing systems of people, I'm going to talk about how we scale those people and their access to do systems through our packages, start with an entirely hypothetical, but probably some of your story. It starts with a question.

I want to connect a tissue sample. Inventory to a patient's clinical data is not something I've done before. So, I'm not quite sure how to access the samples table, or how to make a sample to a patient but obviously asked if possible, right? So how do I get started? If you believe the Big Data stock photos, I go to the self-service data wall and the numbers that I want in reality, it probably starts with an email for many emails, I started reaching out to someone. I know and data engineering who manages that particular get a resource and I see what they can tell me. Dear friendly, did

ABS person. How can I connect a sample inventory to Patient level clinical data? I've heard that you know the secret thanks Garrick. I fire off the email and a little while later I get a reply. Hey, Derek, we use the population table, good luck. And the email came with an attachment that I can open up. And I'm immediately hit with a wall of SQL code, doesn't look pretty. But in a couple hours, I'll probably get the gist of it. And if I go looking in here somewhere in here, there's probably some tables are referenced sample. Table is a patient table, there's a simple

indicator and afterwhile of puzzling, I realized that server about turning coded values into text labels. It's at least code right. Well since for emailing files around, sometimes give it a query like this in a slightly different form. Like a Word document where the query doesn't really fit on the screen or the page. And let's just say, formatting choices are fluid. Sabrina side emailing in Word document format SQL queries are not a great vehicle for knowledge transfer. They're good for precisely communicating

data specifications. In the robot language that we have other ways of working with data to have been specifically designed with humans in mind. For example, deep fryer, who's API is very intentionally designed in line with the philosophy that code is written for people to read and only incidentally for machines to execute. This reminds me of a great quote from Jenny Brian. Of course, someone has to write 4. Oops. I mean sequel code. It doesn't have to be you. So let's take a look at what this query might look like in an alternate universe. Here's the same query, Rewritten using a

blend of the flyer and custom functions that support our particular setup. Let's walk to the code step-by-step and see what it represents. First of all, we call our universe, the massive burst very much inspired by the tidyverse of a single Library. Moffett verse load to come instead of packages that we use for nearly every day of request. Most of these packages come from the tidyverse but we also include her own supporting package at CVS specifically tailored to my team's workflow. This creates a common starting point for everyone on the team and also give this a formal on-ramp to

install and setup database dependencies, that we can leverage and specific packages that interface with our money back and systems, connecting to a specific date of a straightforward. You call use back end and the name of the database server that you need to connect to. In this case the petitioner ABC database Behind the scenes, this will load database, specific packages, including a specific package for this resource called Martha ABC and each of these back in packages has two primary goal. The first is to simplify access. So by default, Matheny BC will not only

remember the incantations require to connect the ABC database, but it'll actually manage the connection for users internally. It also provides easy access to tables with functions, like a b c table. This kind of hides a bunch of other assorted less inviting the fire code and end it manages the connection for the user and it also connects two tables in the air, directly and we connected the three tables. We need samples patients, and a sample indicators table. Okay, I just want me to set up for work space and our environment, and we connected the tables

that we need. So we can now focus on how these tables relate to each other, how we can get from samples to patients through a series of left join. And finally, the final line speak to the second goal of the back in specific packages, which is to wrap Tom and tedious, or error-prone database moves into standard function. We have a lot more flexibility to write functions, use tidy, Select Title of a land more and do things that would otherwise be very hard to do in Sequel. Like applying a not deleted filter to all of the tables used for automatically looking up text labels of coded values. Okay

let's take a step back or less repetition in for a question of Arbor sequel. Is that this code does a much better job explaining to human? How did it is being collected and transformed? They're still here. But I still see beecause. Functions live in our packages, they bring a lot of contacts with them. So let's take a look at the store. So the ABC Choice replace function. We've already seen that, dysfunction, everybody seen that are naming conventions at communicate, the functions

intent. Right. So we can read this like and then replace the choices on top of that. The function name is chosen and a discoverability. So, in other words, a user can easily find other functions that operate on Choice columns by exploring autocomplete and typing ABC underscore choice and see what other options are available. Because this function lives in Ann Arbor package. We can document what the function does and why right next to the code and the documentation is Comfortably available, right? Inside the data analysis, environment, the body of the function can be sitter considered

technical documentation recording how the function works is more precise than just a description of what best practices are. We've learned that when interfacing with more technical teams, the function itself becomes specification for how we accomplished at, which makes it easy to say to engineering. This is what we do. This is what the new platform used to support. Taking another step back. This function isn't just about making life easier for someone working with the state. At we now see that it's a self-contained unit of knowledge in this. Even our packages place to keep

code. It's where we store best practices or lessons learned. It's how you share that knowledge with others on your team and stancy websites. Seriously willing for package development is amazing package down there. That make they don't just make your code pretty and browsable and shareable and discoverable. They make your package documentation of viable, knowledge repository, and a place to turn when you need to learn something new. On top of this, if you're using Version Control front ends like get Hub or get lab, you can also have a public place for sharing knowledge, asking questions

or getting help when things break down, rather than sending emails. That are only seen by the people copied in the email, you can open an issue where your question is seen by. Somebody else answered publicly available for future reference and maybe becomes the basis of new functions and new functionality. So, I'd like to close with a few practical tips about how to make this happen in your organization and teams. The first one is start small. Start with one team and make their lives better. I guarantee you that if you look for look for it you will find a painful Emmanuel process just

waiting for hero like you. My second tip is to stay small. So rather than throwing everything into one big monolithic package that everybody uses. I've had success creating smaller more Focus packages. It gives me a little bit of freedom to experiment and also, to make sure that I'm providing targeted solutions to the problem at hand. My next step is to use vignettes are great way to document and share processes that aren't easily captured in a single function or even ended up in our code right next to documents database driver setup and configuration or to show

how you would accomplish the whole game analysis from start to finish. And finally, be opinionated provide a happy path to a range of workflows. Help them fall into a pit of success, by making sure the happy path is as smooth as bump-free as possible. None of this would be possible without a slew of packages and resources. Key among these are used this and Dev tools which are great four pack of the building for oxygen 2 and package down for package documentation. If you're new to package, building art, Hadley Wickham, and Jenny, Jenny Brian's. Our packages book is a great place to

start learning about our packages and it's an invaluable resource to turn to. When you get stuck, we also used rat by dark at Oviedo to create an internal crane-like package repository and it made package installation so much nicer and easier for our users. And another option. There is also our Studios package manager and finally, a big shout-out to my kearney's package, burst template that made made it really easy to pull all of our packages into a cohesive unit and then to create something, a fraction is cool as the tidyverse ourselves.

So with that, I'd like to say thank you for for giving us the opportunity to talk about our experience building packages and I'll leave it here. I mean, I can talk about this for a long time but thank you for the opportunity and you can find Travis and I online if you like to talk more about this, otherwise, I wish you the best of luck, building your own Universe of our packages and happy. Are you I'm seeing questions I could answer some I thought of Madhuri would jump back in. Lots of questions about the slides

in Sharingan is the is what I used to make the sides and then a lot of extra HTML and CSS. A lot of spine, crafting. I call it. Thank you. While I'm browsing questions. If I could pick one more from Peter, you're getting Anders like to randomly change names of tables and bills in the mood strikes on, that's a great question. And that is why we have two teams that are complementary in this regard, I'm the dead engineering teams, I mentioned and the health informatics team, which has it governance function, where we make sure that we approve any timetable names that have the name

things that happened in the field name, change. Hey Travis, another question. How do you guys get over the learning curve to introduce people to functions? So I'll take that. I think I think definitely has a package developer. You have two people, maybe two or three people in mind, you have first the very new users were going to use your functions and I see my rolls. I do a lot of watching over other people's shoulders and seeing how they approach a problem. How they, how they tend to code with that.

And then often I start to see patterns emerge between how one person is doing one thing. And another person is doing this thing. And I guess by having those conversations, then I start to think, okay, well we could build it into a function and so maybe I'll find a way of curator of these processes but then this is also something that you could eventually train your users to to write them themselves. There is another question here. I will question or comment. It would be interesting to see how you handle huge data sets in the packages. Any sense

that? Yes, or we often we read packages that connect to databases and rather than than putting large datasets in those packages. There's a lot of value in that though because if we find that it's basically every database has its own quirks. Its own weird way of storing the data, the one, or two things that you need to know about that date of me and the taxes are awesome way to start a document that knowledge Okay, great, thank God. I was great. Those are really great.

Carol and the session for people over to the next.

Cackle comments for the website

Buy this talk

Access to the talk “Build Your Own Universe: Scale High-quality Research Data Provisioning with R Packages”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “R/Medicine 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “Medicine, Health and MedTech”?

You might be interested in videos from this event

August 18 - 20, 2020
Online
6
40
bud, compliance, covid-19, hospital pharmacies, pharmaceutical compounding, preparation, science, stability testing

Similar talks

Corey Fritsch
Applied Data Scientist at UW Health
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Anton Becker
Attending Physician at Memorial Sloan Kettering Cancer Center
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Riyue Bao
Co-Director of Bioinformatics at UPMC Hillman Cancer Center&Research Associate Prof at University of Pittsburgh Medicine
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “Build Your Own Universe: Scale High-quality Research Data Provisioning with R Packages”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
816 conferences
32658 speakers
12329 hours of content
Travis Gerke
Garrick Aden-Buie