Stephanie Kirmer is a Senior Data Scientist at Saturn Cloud, a platform enabling easy to use parallelization and scaling for Python with Dask. Previously she worked as a DS Tech Lead at Journera, a travel data startup, and Senior Data Scientist at Uptake, where she developed predictive models for diagnosing and preventing mechanical failure. Before joining Uptake, she worked on data science for social policy research at the University of Chicago and taught sociology and health policy at DePaul University.View the profile
About the talk
To apply AI effectively in the business setting, and to get the optimal benefit for business decisionmaking, data needs to be ready to use and easy to access for data science teams. Journera (a travel industry data startup) has built our data warehouse using Airflow and AWS Redshift, and we're using it to access and analyze hundreds of millions of records on the fly. This talk will share an introduction to each tool, walk through the pipeline that can be built from any data store to the Redshift platform, discuss the architecture of a relational data warehouse in Redshift, and give tips on how to avoid mistakes we made in our own process.
Thank you so much for the little introduction. Let me get my slides up and visible. Yes, you're ready to go. Wonderful. If we get full screen, thank you so much for the for the introduction and thank you for having me. I'm really excited to be here to just add on to the introduction a little bit at Saturn Cloud. We work on making a cluster computer available for data scientists. So if anyone is interested in learning more about that, we have a free trial of her platform that you can
use to test out using a GPU or CPU clusters to run your machine learning data science code in Python. It's very exciting and really give you a lot of opportunities to do things that that you might not have. Otherwise be able to on local or a single node machine. But today I'm here to talk about something quite a bit different because in a couple of different contacts in my my working career, I've had the opportunity to use air flow and Relational databases including Amazon redshift
to create a data pipelining for modeling and for machine learning work clothes. And so I felt like this was really something that I that I found useful and practical but also that had a little bit of a challenging and entry to for the new users. And so that's why I wanted to produce the talk Institute to bring it to you today. So that this sort of option of you. Which is one of many ways that the date of processing data pipeline can be handled is a little clearer and easier for for Folkston to consider if it and see if it might be the right thing fruit for your
business. You can also find the size that I'm showing and other materials about this at get home and you see the link is at the bottom and then you can also follow me on Twitter and I have links to two materials like this and other toxins. So when are you bringing this talk to you today? I wanted to make clear that I expect. Nobody on this in this audience to necessarily know anything about are so rich. We're going to cover that. We're going to talk about what they both are with the functions are in the broader sort of workflow that I'm just
driving. I'm really in the data warehouse in context because it building a good data pipeline that generates clean useful data for warehousing is really key to effective scaling of machine learning work clothes. If your if your data is orderly and needed access for machine learning data scientist machine learning engineers, then you're really setting up the groundwork for for high-value Mystery Machine learning in your workplace. And so then we'll talk a little bit about what different work clothes might be for that super training as well as
prediction or inference. Do other things give you an ad various points during talk a little insight into mistakes. I made when I was working in these tools and when I was building out infrastructure with these tools because there's no reason that you should have to repeat the mistakes that I've already had and felt so going on. The just to give you a quick overview of the things you need to kind of be be conscious of in that you might be thinking about as you get on with this redshift is and we'll talk about this
morning with a w s tool that is a relational database and the link to installation and getting information about that is right. There. It is. Free. Of course, I think it was probably where it does have a and we'll talk about this in the moment common ancestor with postgres. So it's actually if you're thinking about what kind of relational database does this most resemble an Open Door adoption is postgres and then we'll talk about Apache product and it is open source. You can see the entire code base and installation instructions are at that link. I find that it is
really helpful. If you're working in an Enterprise context in working in these tools to have someone in the devops Ops for the side of the business to help with some of the scaling and sword. Infrastructure needs for this kind of project. But if you don't have a person like that, I still think that you can get a lot out of this and you can still use these tools to be effective. Stop service redshift as I mentioned. It has a common ancestor with postgres. So if you're familiar with that kind of SQL course, the sequel is there are dialects in flavors of various types, but a lot of the syntax
30s and Chrissy's that you've met with a post postgres will be similar in redshift not entirely the same you'll find odd little quirks that are just for variety of reasons, but it gives you a framework to start from so one of the really interesting advantages of redshift that does make for some interesting Christie's is that they're using columnar storage which means that the data sort of sort of storage eyology is about data in columns allowing for really powerful compression algorithms in their offensive Links at the end of my of my Jack that will tell you a little more
about that if you're interested, but this means that there are there are there are performance differences that you might have. might not have come across in previous work with relational databases that are more row based will talk about that a little later Chaser paralyzation because it's distributed you can still have classes especially when manipulating data because it is one relational database, but you're really going to take a look at an architecture diagram of it that the parallelization
is is a really big advantage and then as AWS does with lots of other tools sagemaker, for example, and other ones you can programmatically create and Destroy instances in clusters redshift so you can really stay up and down resources depending on the size of the data that you're working with and not have infrastructure that you have spent a lot of resources to create that you don't need So this is an overview of architecture for red shift. So this is a distributed data storage solution as you're seeing and you may be familiar with other distributed
storage auctions elasticsearch is an example of this but that's of course not relational. And so this diagram shows you at the top client applications end-user. That's where you might sit and then your interaction as a user of virtuous will be with the leader. No doubt queries are sent to the leader known and the lender is is responsible for interpreting a query that's that sent not retrieving the data interpreting the best way to divide up the tasks that are contained within that query and pass them to the compute nuts. Those will receive instruction from a leader that's already been
interpreted. It won't look exactly like the query you sent. It will look like a revision of that query to give a chunk of it to the computer and then they will run the request for data on slices of day. Is that are split off in each computer? And so you'll use distribution keys with your talk about a little bit and just a second to decide how your nude slices are going to be divided out you might use a distribution key that is date based. So that more data is that is on the same time stamp or closed timestamps will be sorted together or you might use s a a customer ID.
So the customer IDs that are close together will be sort of together and someone so there are a few things that are valuable to make sure that you're using riches to its full potential. Remember we're talking about relational database. So there are no keys and interactions between data, but it is distributed. So use the distribution key to make sure that data you think you'll want to pull together is stored together and the data that it is is organized in a way that makes sense for the relationships within it and you're the only one who can really know that right because this is your
business data and having knowledge about the underlying meaning of the data is key to deciding the right way to store one really interesting thing that right shift allows you to do as well as to ask the database itself for the most appropriate compression algorithms and Coatings that it might choose their different ways. You can apply compression two different fields in your data. So column wise you can set compression choices and if you run the function analyze compression on the table that you're thinking Or that you've created for me, then it will tell you the appropriate
compression and you can apply that compression after actually creating the table and that's really nice in general when doing work with relational databases. I think that planning strategic concepts are important to to think about because you want to make sure that the data storage you creating will scale with you will grow with you and will adapt to the date of that. You're going to pass it. We don't always know what's going to change about our data or about our business in the future and that's absolutely understandable. But we can think about is is trying to accommodate expected or or
reasonable to foresee changes. There's another kind of key that you can use in redshift in addition to the distribution temperatures this sort key. Would you organize your data within the nodes that it's the slice to and this can be really useful if you choose a sort key that is a field that you often. So if you might be selecting data, if you want to store it in the distribution key that makes sense that soften the timestamp or something like that or you're sort key might be something different that you used to arrange the data in your queries bring outfit.
Unexplained is a nice little function that's on the same lines as the unless the person function a helper function to make it easier for you to use this storage effectively. So if you run a query you're going to send it to your leader note in the leader note is going to decide how it might interpret that and pass it to the cluster that you're working with if the leader note receives a query that determines the really the only way to fulfill its to send all the request to one cluster and maybe one slice that's going to be very inefficient cuz you're not Distributing the national
league expensive on one particular Little Slice. You can run explain on this function that you're a queer that you're trying to run before you send it to your leader down and really what it will do is it will produce the process that the plan that it thinks it might be used to execute and I'll just give you the plan and say hey does this plan look sensible? And if you see that this is going to really be Highly cost inefficient. You can revise your Curry according. So one thing to keep in mind is that red shift is very eager to help has a lot of
nice functions built-in to make your use of it more effective, but it will allow you to do some things that might be improved. I guess. I'm so used to it doesn't require Keys. It doesn't record unique keys on fields. You can you can create tables that have no kids in new relationships in bedded that are going to identify its relation to other tables in your database and that is an opera if you if you know what you're doing it can be okay, but if you're building tables that don't relate to any of the tables and you're just sort of creating extra content without
having a plan for what it is supposed to sort of supposed to operate you can create d3d sort of data is technical debt that your building and tables that only one person understand there only one person know And a lot of database systems will sort of have safeguards built into trying to avoid those sorts of situations is not one of them. Let's just one thing. I wanted to bring up that you need to be thinking about the the choices. You're making the architecture decisions. You're making in your database because Richard is not going to force you into the better habits.
There's an example of creating a table in redshift if anyone is familiar with using postgres are using SQL in general is a wild look very familiar. And this is one where I'm using a few different aspects tea to identify things. You'll see that this first row is using sort key. So I'm using I'm sorting by the customer identifier and it's also using identity one one here, which is actually Inc Auto incrementing identity Yorkie field for this customer starting at 1 and incrementing at what right? So I'm creating an integer that will be unique for every
customer added this table and it's going to do that for me. I never have to assign this I never have to set it again. And that will give me a known unique identifier for customers that I can reuse later on and I'm setting it as the primary key. Even though red shift is not enforcing a uniqueness on that primary key a little example. I also think that there's wisdom in using the copy command what you can use to copy data from js3 internship for example more and be more explicit when you use it. This is a very common way to take unstructured data
from S3 and put it into your redshift database of it is an orderly and easy to use force a machine-learning uses. You don't necessarily need to specify these column names, but I really recommend that you do because if data comes into S3 with a customer home base and customer home base key reversed, you'll never see an error from this copy command unless you set up the Collins and told it what to look for just a just a handicap that I had that I recommend. Stop. There are few things that I can tell you not to do with one of the things you might think that the explain
function and the other stuff about figuring out how to organize your data across the cluster is start of it. Might seem like it's not really that important or not really something to think a lot about let me tell you I am accidentally written clearly. That was very very poorly optimized set it on Friday and came back on Monday and it was still running because Richard will try to do what you ask. It will try very very hard to complete the request that you've made but if you made a dumb request like I did it will just run and run and run and run and never get there. And so use that
explain function really do the optimization and you will find much much more pleasant user experience. I've already talked a little bit about creating technical debt of excessive tables that are one-time uses. You can use vuse you can use this kind of other SQL Tools in redshift. Avoid having this sort of thing happen. And so keep your database tiny and that will make the modeling process and the Machine learning to ask at the end. Easier to use this kids use those relationships, even though Rich if isn't going to force you to and then use the
compression because you really are going to see the tremendous amount of space savings and say living room in your data storage, especially in these days when we are all collecting as organizations and companies and businesses are collecting data in massive volume really finding an easy convenient way to reduce the quality the quantity of data storage you doing can really save a lot of money for your organization. Stop. We talked a little about the data storage option that were that were using in this workflow. But I'm going to talk now about airflow which is a job scheduling tool for
programmatically building scheduling and maintaining these tasks that you want to run it. Some interval any interval is is is usable even ad hoc. So it's python-based. That's first thing to keep in mind. So bring your your your approach this in in thinking of all of the tools that we're going to talk about and the programmatic sort of source in text or talk about his python me directed acyclic graph in the diagram that I'm showing in the bottom of this slide is exactly what that is. The phrase sounds a little gnarly but it actually is very easy to sort of
work out with me. It's directed means it goes One Direction. So it's chronological is working in a temporal fashion a cyclic. It doesn't go back on itself when it gets too. And there's an end there has to be one today gram of tasks and their relationships to each other and air flow as a tool allows you to do testing versioning and I'm kind of monitoring has great logging and greet UI for visualizing the processes that your tasks R&R undergoing and it's open source. You can do the entire code Base mission bug fixes additions that you think will be valuable.
You'll notice that we're talking again about a distributed toolkit and that I think is being made me maybe indication of the start of the way the data science is going in the modern text with Saturn V use distributed to all the time as well. And so you was the user are on the left side of our screen today and this part not the top and you'll be interacting with air flow through you I and that can be connected to your GitHub account. For example, you could create your air flow tax push them to a GitHub repo and then this this works. Well, we'll just pull them every 5-10 minutes how
every often you decide but you can view what's going on and how your cluster is operating at any time for DUI the scheduler in the web server or two tools that dinner intermediary between you and the actual cluster workers. The web server is serving you the you are that make sense. The scheduler is taking all of the the request you make the jobs that you have tried to To put on to the airflow Custer and it's deciding where they go which workers going to handle each one and it's deciding how to organize them in the most time and processor efficient way. So I find that this is
where your Ops guy or lady might be helpful. This is where you might find someone who is a little bit more experience in setting up a server-based types of things could be could be someone you want to get in touch with but it is possible to install this yourself. I promise and they're great tools and resources online for people to to learn more about it, and I've got to wake somebody up. Who directed acyclic graph is created using what's colored dag definition file? And so I'm going to talk a little bit about exactly what the duck is. It can look a little scary when you're starting out
and looking at a dag definition found thinking. Oh my God, what is all this stuff? So let's just talk about what you're doing and that you're giving your instructions to airflow so that it knows what you want to do and how you want it to be. That's really all it is. So you start off with something about what tools you going to need to build in yours an example. You want to import the airflow tools, of course because that's where you're what you're actually building you want in Boardman me some other python libraries that you're going to be using here and they're totally fine. And then you
might have custom written tools that are air flow tools, but you have customized them for your own use cases. We'll talk about that in a minute and then you might say if you're working with Rich if you might have a need to write SQL, that's the way you're going to interact with your with your database so you can Custom SQL can be written in strings and stage and actually a python script and you can import it just like a python Library. Stop giving it to us know what libraries we need modules from airflow. We're going to be working with we have a good start but then think about this
were setting up a job. You can have multiple tasks doing different Little Steps, but we need them all to understand their part of a single unit. And so what we'll do is we'll create what's called default ARG and sort of organized that they know who the owner is when the start of the job is needed things. They're sort of information about the whole package that needs to be available to every subtask and everything. So this one includes a couple of things that you'll notice on failure and on success functions, and these are some of those custom functions that I talked about earlier that I have
written specifically so that are our own call system pagerduty in this case one jobs fail or succeed if they if they've been failed previously and so we can trigger an alert to the The technical person is on call or we can resolve it if the jobs. Now we're into the real meat of the thing now, we're ready to talk about. Okay, what is this about going to start? But just creating sort of Michelle of a dag the shell of art work today. So we don't really need any stuff in it yet. So that look like
we're going to start the interval of the schedule you use Chrome send text with us, which is maybe not the easiest to understand but it's certainly very back with us. And then the earlier just passing the minute or so. It knows everybody that goes inside the Stag all of our little jobs. They need to have access to these and these pieces of information. The finally passed I'm going to use what are called operators and we're going to Define tasks with these operators until
we talk about what an operator is cuz this can get a little bit confusing. If you're not familiar the task and air flow is just it's just setting up an action to happen for scheduling. So to speak that some job should be completed a certain time. The operator is the class that you call to initialize that action. It's not going to run when you initialize it when talking about a class rates and operator so that then is ready to run when our scheduled time arrives. So it's sort of being being put in place and ready to go waiting for the for the time to come
when it will be kicked off anything can be an operator. You can't dare operators that are very dead already built into airflow. The usual easy ones are already built in because people use my lot and there's no point in having to rewrite it every time. So A bash script run one that's an operator already exists don't have to write yourself run a python script same thing. So you can write your Python and then just say airflow go to this location find the size and script run it when the schedule says so and that's it. And it really won't take up nearly as much mental like energy
or extra effort than as you might think for example to get it from S3 put in redshift to redshift command. These are all things that are also built in your air flow operator can really be anything you want and it's just a class that you called sellotape. There's this here's the one we're going to run here for the script is just give it and give it a second get ready. Cuz when the schedule says go we're going to go. So there's also of course ways to write your own custom operators. I find the starting with one that exists already and just tweaking it and then saving it underneath that's really my
my preferred way to do it because I try as much as I can to avoid rewriting things that don't need to be written again from scratch, but you've got lots of options and there's a whole open source community of people that will be happy to give you advice input and help with writing an operator and you don't need to submit it to the open source project on GitHub. You can just run it in your particular installation of airflow and have it work for what you need. So this might look a little scary but this is how we would actually initialize an instance of this operator.
This particular operator is the age of USF Tina operator, which means we're going to run a query on AWS, Athena Athena is the sort of interphase tool that used to run SQL queries on your S3 data storage. That's that's the basic wave of describing it. So in this case AWS, Athena operator records, we have told it to give it a name. Naturally. We told me what day it was it which is important because remember earlier we added a new dag instant gave it the name. We giving it some
query text has sequel that we've written in a python script as a string and we're passing in a built-in are Flomax. I put some links at the very end of my my text to talk a little bit more about what these macros are all about. Then we're giving it connection credentials for interview asked which are actually conveniently stored in Prudential Management in airfloat. So you don't need to handle that yourself and then we're telling it weird save the results and that's basically the amount of it. So this is how we would set up the the task find records. But remember is not going to run yet and
it won't run until we tell airflow to run the job the dag that it lives in. Stop to rehash a little bit thing that really I think makes the makes it a lot easier to work with us to understand a sickly crafting understanding what needs to run and when and using the paralyzation opportunities that exist in that you can run a lot of stuff in parallel using air flow. If you think about carefully organizing the tasks that need to be or is it need to be run user credentials management is so easy cuz then you can just passing a keyword for air flow and it will know
exactly what credentials are you using work on that and you do it you don't have to use a whole bunch of environment variables that kind of thing set manual for airflow is really quite nice special friend open source project where you know, there's a lot of soda, you know slept together tools out there for actually figured out what happened. Seeing how your jumper running and figure out in about two mice. Stop there a few things to keep in mind though, when you're working an
air flow your if if your worker is holding a lot of data in memory say you loaded in a hole, you know csb or something. Like that is a pandas dataframe. If you don't need all that memory that all of that data's be accessible to the worker at any given time. I recommend not actually keeping that all I want you can Implement streaming things. Are you can load certain amounts and not others that kind of think there's a lot of options but I have found accidentally that I overloaded workers by putting too much dubbing. Just loading it in and thinking of this is going
to be a big deal and just leaving stuff in memory without actually releasing it. Dan code can be a little complicated and you can end up with accidentally duplicating things a lot in this is there are opportunities for efficiencies in all of this and I recommend taking a look at the ways that your dad could can be can be a little more streamlined and then keep track look at the dogs look at the errors and make sure that you're keeping track of what manipulations you're doing on your data using an air flow
John cuz that will make it easier for you to you know, debug later if something should go wrong. So I'm going to give you a couple of quick practice losing was we're running close to the end of time. So I'll make sure that we get we get a couple of these in and then will you be able to ask me questions after on Twitter were online if you if you're so inclined. So this is just a sort of view of the things that we're going to talk a little bit about doing data warehouse population scheduling tasks for data analysis and modeling and then managing and updating your data
warehouse for including making sure that end users can get a hold of it. Did dating Justin work so might look like this you start with data in S3 a pass it to rent in Philly. This is the copy command more or less. However includes a lot of really robust error handling and management of retries because you're using air flow air filters you really big advantages and that regard can schedule it at the interval that makes sense for you and you really let you'll have tools to figure out what was wrong if something show Updating data that is a little bit more complicated because when we're
working in this this kind of infrastructure, we want to use a single transaction to avoid data loss. So we're going to do a single job of deleting pre-existing dated and inserting new data from an intermediate table and I've got another sign that will explain a little bit more about how that works. But I want to talk a little bit. I want to leave that fart for you to read and I'm going to talk about the machine learning workbooks. So here's an example. You can have your training data stored in redshift. We may have ingested it from S3 Wheels in earlier workflow, for example, and
then an air flow dad can actually do your model training and that train model object can be pickled stored in S3 or wherever is convenient for you then your prediction on new day that can use the same format of data in redshift presumably pull that and those two two sources of data. Content can then work together to create new predictions this whole thing can be running to airflow jobs. As long as your data is stored appropriately in redshift until you can train them on monthly or whatever is
appropriate and then run predictions daily or even more often using airflow. Is the prediction evaluation work? So you would just take an additional step of creating in job for evaluating predictions and organizing that with your ground truth data and then output predictions for sweet cake. I have a note to your about update in about the columnar data storage and compression reasons. Why insert is much easier to work with in registered update, but I hurt you. I don't think I have time to talk too much about that buy lunch. If you have questions about this, please let me
know and I will be happy to tell you can find me on Twitter that I think there's a couple of links here give you more information available and get home. But otherwise, I am very grateful that you had me in and I look forward to hearing from folks on online. So to talk about these things.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.