Duration 39:43
16+
Play
Video

Functional Data Engineering - A Set of Best Practices | Lyft

Maxime Beauchemin
Software Engineer at Lyft
  • Video
  • Table of contents
  • Video
DataEngConf SF '18
April 17, 2018, San Francisco, USA
DataEngConf SF '18
Request Q&A
Video
Functional Data Engineering - A Set of Best Practices | Lyft
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
37.1 K
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Maxime Beauchemin
Software Engineer at Lyft

Data engineer extraordinaire and open source enthusiast. Creator and lead committer on Apache Airflow and Apache Superset.

View the profile

About the talk

WANT TO EXPERIENCE A TALK LIKE THIS LIVE?

Barcelona: https://www.datacouncil.ai/barcelona

New York City: https://www.datacouncil.ai/new-york-city

San Francisco: https://www.datacouncil.ai/san-francisco

Singapore: https://www.datacouncil.ai/singapore

Download slides: https://www.datacouncil.ai/talks/functional-data-engineering-a-set-of-best-practices?utm_source=youtube&utm_medium=social&utm_campaign=%20-%20DEC-SF-18%20Slides%20Download

Read more about the talk in this blog: https://dataeng.co/2s7hEGV

ABOUT THE TALK:

Batch data processing (also known as ETL)  is time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot.

In this talk, we’ll discuss functional programming paradigm and explore how applying it to Data Engineering can bring a lot of clarity to the process. It helps solving some of the inherent problems of ETL, leads to more manageable and maintainable workloads and helps to implement reproducible and scalable practices. It empowers data teams to tackle larger problems and push the boundaries of what’s possible.

ABOUT THE SPEAKER:

Maxime Beauchemin works as a Senior Software Engineer at Lyft where he develops open source products that reduce friction and help generate insights from data. He is the creator and a lead maintainer of Apache Airflow [incubating], a data pipeline workflow engine; and Apache Superset [incubating], a data visualization platform; and is recognized as a thought leader in the data engineering field.

Before Lyft, Maxime worked at Airbnb on the "Analytics & Experimentation Products team". Previously, he worked at Facebook on computation frameworks powering engagement and growth analytics, on clickstream analytics at Yahoo!, and as a data warehouse architect at Ubisoft.

FOLLOW DATA COUNCIL:

Twitter: https://twitter.com/DataCouncilAI

LinkedIn: https://www.linkedin.com/company/datacouncil-ai

Facebook: https://www.facebook.com/datacouncilai

Share

My name is Maximus talking about functional data engineering and a set of best practices that are related to this functional approach functional programming. I'm going to get much more into that. But before I do I want to set a little bit of the context for this dog. So first thing a little bit about me, but pretty much everything's been said so I would have left before working. At least I was at Airbnb Facebook Yahoo, and Ubisoft and while I was at Airbnb I started to open source projects

that are now. Thank you wedding at the Apache software Foundation. The first one is about to air flow. It sounds like a do you want a room already pretty familiar with Apache airflow air flow is a workflow engine. It's a tool to help you author schedule monitor your data Pipelines. That's largely inspired by something on Facebook called it's warm and it has gotten super popular over the past few years. I've since then so I've been writing more JavaScript lately. So it's rare that you see someone who writes a lot of jobs. Speak at a data engineering conference, but I guess

it's happening now. I'm so super set is a data visualization platform dashboarding did exploration and all that good stuff. So if you haven't heard about it, I urge you to take a look at the getting really good superset has Engineers from Airbnb left Twitter an apple collaborating and the project is really taking off at the moment and I've got I've got apple and asking me to do something with USB C and I will say not now. Thank you. Not the right time. Hi cool. So this

this talk today is about a blog post. I wrote recently that's called functional data engineering down for batch data processing this takes what takes place in a contact in the context of a few blog post that I wrote for the past few years the first blog post that I wrote about Ben engineering was called the rise of the data engineer and it was about trying to really understand. What is the data engineering roll or does that fit in relation to data science data platform engineering? What does that fit in relation to historical jobs in titles like this

intelligence engineer data warehouse architect now, so this is what they supposed is about. If you're a did engineering you haven't read the post that would probably charge you to take a look. I meant to read this post. I really tried to make out of a Manifesto for what did engineering is then over the years. I really sometimes an ungrateful work to be a data engineer toolings really hard at the space that moves really fast. It's a real that's really underappreciated. Hopefully other people are going to talk at the

conference about that and about solving these problems. But so yes, I wrote All About the hardships of being an engineer nowadays then more recently. As I said, I started writing a lot more jealous good for Apache superset and I got so there's this thing of the Joshua Community where people are getting really into the functional programming Paradigm. So I drank the Kool-Aid I got interested in functional programming and I drawn a lot of parallel between the best practices that we use at places like Facebook and Airbnb. With the functional programming

Paradigm, and this is what I'm about to talk about today. So did some of that stuff is old news. I see something I asked my colleagues from Airbnb and Facebook. And first some of them some of that stuff may sound like total total old news. So there might be a little bit of sense of like, of course, you know, you don't have to tell people this everyone knows that stuff but that's not always true. I realized that a lot of the best practices I took for granted and places like Facebook in Airbnb were not necessarily I respect that or they're not common knowledge also kind of the Roadshow for air

flows. I've been visiting a lot of companies talking about open source talking about air flow and I also realize that in those places those best practices were not coming knowledge as well. So so I was like, let's write a blog post about this use that as some sort of Manifesto again as to set up like a way on approach to doing things. And on and then, you know, maybe Talk account Frances about it and he ran so before I get into functional data engineering. I want to get into functional programming. So I'm sure a lot of people in the room here are

the Fairly familiar with object-oriented programming. That's what is thought and in schools. I'm sure you know pretty much all day of the engineers in the room to be able to describe. This ride object-oriented programming programming is just an alternate approach. It's a different paradigms the different ways to think about structuring and organizing your code. I'm not going to read the whole definition here from the Wikipedia page because I didn't come here to to read Wikipedia articles, but I'm in a raised a few points that I think are most relevant and most interesting and most

relatable to data engineering So I am going to start by reading from just one line for a friend the Wikipedia article that's isn't functional code the output value of a function depends only on the arguments that are positive function. So calling the function twice with the same input will result in the same outfits that brings me to the concept of a pure function Superior functions and functional programming are functions that are limited to their own scope. So so that means that given the same out the same and put it would provide the same output. That's a nice guarantee. It's easy to

reason about there's no side effects. You can easily unit test these functions. And that brings Clarity to the process, right? Just knowing that this function is limited scope is a nice guaranteed to have another step in functional programming is immutability say mutability is foundational to two functional programming and see idea that in pure functional languages. You won't be able to to reassign a viable. So viable is a constant. Once it a viable has been assigned. You cannot

reassign it. Once you create an object this object as immutable object. You're going to have to somehow create a new object and effect it to a new constant and because of that it completely changes the way you code in the way you think about your code if you see a variable that's been affected, you know, it will stay the same within that scope so I think this translates and I'll get into how this translates into some Concepts and they'd Engineering in just a moment. Another key concert in functional programming is idempotent C is the property of certain operations in

mathematics and computer science that they can't be applying multiple times without changing the result beyond the initial application. That means if you have a function. Then you can run it one time. You can run it two times. You can run a 10 x you'll get to the same desired states with confidence a simple kind of stupid example of that would be if I write a function that says add a little bit of water to this glass that is not an idempotent function. If I keep running this function at some point they the glass of water is going to spill right and it's it's instead of doing that. Write a

function that says fill up the glass with water and fill no more. That's an odd function and I can run at any point in time and get to desired State without spillage So now I'm going to talk a little bit about functional data engineering or try to apply some of the concepts. I just talked about and how they map into the did engineering World here when I say data engineering I'm mostly thinking about like batch computation. So, you know, I'm the bad guy because I thought I wrote airflow I wrote them at least the first version of airflow and I'm thinking

also mostly from a data warehousing perspective. So I know they'd engineering is not just batch. It's also streams and it's not just didn't wear their data warehousing. It's all sorts of computation the concepts I'm talking about today certainly do that fly to data warehousing in batch processing and in many cases will apply to streams and two other types of computation they may or they may not the first thing I want to mention is some of the motives for this methodology that I'm talking about a functional did engineering is reproducibility. So as you know reproducibility is foundational to

the scientific method write a future scientists and you make some great claims and you write a an article with outstanding results. Your peers need to be able to reproduce a result of our science is not Advanced by an inch, right? So reproducibility and sometimes it's legal also from it from a call from a legal standpoint. It might be critical for my sanity standpoint. Richard did engineer and you how you write some jobs around some jobs and you run the same job today as you ran yesterday and you get different results. You'll probably go a little bit insane over time. You're probably

already a little bit insane if you're a data engineer anyways, but you know, you don't want to make it worse. self so this dysfunctional approach that I'm talking about today is all about reproducibility and can it sanity to So any other for partition side I talked about it a little bit. You know about immutable and Middlebelt Rd being a key concept and functional programming in the data warehouse. You can think of partitions or noodle partitions as the building block of your building blocks of your data warehouse right to partitions become kind of the foundation to

your data warehouse the equivalent of new objects in because you don't want me to take anything you can add need to add these partitions into your Warehouse one at a time. and do you have to probably partition all of your tables cuz you don't want to mutate your tables. You don't want to change them everyday, right? If you want to you want to sincerely partition everything you want your EPL schedule to align with your partitioning scheme. So that mean if you're running daily jobs or hourly jobs should have daily partitions in pretty much all of your tables.

And what do you think about this? I'm sure when you guys think about David lineage if I say the word they didn't Asian you close your eyes will probably see this graph of tables, right you think of my data is this graph of tables? But when you use partitioning you can start thinking of data lineage as a graph of partitions, so that means that the mental model is it extended here and if you think say of Any Given row and say that partition in sales aggregation here in the stable any real you can easily attached to the partition that it's in

and any partition you can kind of infer its data lineage and therefore having this Providence traceability reproducibility and all that good stuff. Now pure ETL task so earlier I mentioned pure functions in functional programming ETL tasks are simply toss that alright important rights to a few rerun your ear safe. You know, you're going to get to the desired State. This is great in distributed systems. You say you have an air flow cluster or do you have any form of large computation that are distributed

taking place if anything fails halfway, you don't know if it succeeded if it ran if it failed, you know that you can always get back to the state where you want to be by re-running this idempotent. There's a deterministic giving the same Source partitions. It will produce the same Target partition that no side-effect. They use immutable sources. They usually Target a single partition. So they're easy to reason about and I also means that you're never doing update up Serta pain. Delete you're essentially inserting new partitions all the time or you are insert overriding

partitions. So, of course that doesn't mean that you should never do these type of operations of update update upstairs attend elite, but if you want it pure can of functional data warehouse, you probably want to always insert overwrite a new object or a new Partition. And it generally probably wanted to limit the number of partitions a source partition that you're scanning just for the sake of Simplicity and trying to keep your competition units fairly simple. And I'll talk a little bit about a little bit more about that. Later. One thing I

want to talk about is this idea has a persistent staging area. So does the term staging area if you read any of the books about data warehousing from like 15 or 20 years ago, you'll realize it's not a new term the idea of a staging area is the place where all of your data that makes into your data warehouse would would at some point make it through your staging area. Usually your staging area is not transformed at all. So you'd bring the raw ingredients from your external sources into your Warehouse pretty much on transformed and old books about data warehousing day

with debates whether you should have a persistent or a transient staging area and Nowadays, you know with like pretty much infinite cheap storage. I would argue that you probably want to almost in all cases out of persistent staging areas. That means anytime you bring data from external systems. You would load into your Warehouse on transformed and and leave it there forever essentially, right? So there might be some exceptions you might have some very high volume low value data that you might want to let you know compress in some way but in general I would just argued to bring that in your

Warehouse, you know, make it immutable put it into read out to my ex file format like Park a or C and leave it there for ever knowing that your data is distraught date of your staging area. Is this the right ingredients and knowing that you probably have all of your Transformations or recipe, you know, you can get you can get you can rebuild the whole Warehouse from your Rod data and your competition at Point in time and knowing I'll coming back feels are it's a really nice of some Shin to have. Colton. Me to get into

like some set of common challenges and solutions. So these are coming challenges around then no data processing batch data processing data warehousing in general and solutions that are you no salt in this functional approach to data engineering bear with me here. I'm about to get into something a little bit complicated hear this idea of slowly changing Dimension said who here as her door would say like they're they know pretty well with States taking them slowly changing dimensions are all about

here in the old books about data warehousing. So Ralph game ball in the life cycle the data warehouse toolkit. I believe wrote about it or gently slowly changing Dimension. Is this idea that in dimensional modeling, right? If you're going to use fact tables and dimension tables to model your business data typical your business entities entities, you will model as Dimensions that attributes to your dimension members are typically changing slowly over time in the question is how do you model that stuff in the data warehouse and here on these images of antiquated etiology?

I can drop tools where people would drag and drop their logic supposed to write in code. I don't know how they would check that in to get get her. But those are the early days of ETL and data warehousing and dragon drop stuff and they tried to get to these are slowly changing Dimension type to which I'm going to get all about this today is all it's not only about functional data engineering. It's also a little bit of a primer on Antiquated approach to to slowly changing Dimension and how to model change in your Dimensions. So out of

the Wikipedia page about Tilly changing Dimension to set a very simple examples that illustrate the, you know, the different types of approaches to slowly changing Dimensions. I know this is really small. Maybe he'll can't read very well so fully you can read enough. I'll try to read for you here. So the first approach slowly changing Dimension type 1 is where you would simply overwrite the data. So here we're looking at a supplier Dimension. So picture a big business and you have a you know suppliers and you want to structure all your suppliers in to a specific

dim supplier Dimension and then removes ABC Acme Supply supplier moves from Illinois to California. The first approach is just to say let's do as if this customer always live in California was always in California where you can update the dimension table and we'll update that sell and now this supplier was always in California, of course, if you're doing your taxes, that's probably not going to fly very well. So, you know, someone's going to be like we need to keep track of that history. And the second approach that the slow change Dimension, you know

any books that you dreamed about is the type 2 to type two approaches to add a new row. So given the supplier code which is a b c and the natural key would go and create a surrogate key 124 and you try to start managing the effective date in your dimension table setting for this time. This is the state of where my supplier is and the only you need to do like extra management of this Dimension and is slightly complicated. It's full of mutations. Also when you when you load up your fact tables you need to do what they call the surrogate key look up which is going to complex

and expensive and your fact table also will be filled with sugar cookies that are slightly unreadable. Alright, I need to make sure this computer is not going sleep mode again. The type three approaches going to Canada a fast approach where you would create a new column and keep some sort of a version of history or what do you think is important about change in time? So do you think roaches as a lot of shortcomings the type one approaches full of mutation you lose history, right? If you run the same query today and

yesterday you're going to get different results. There's no way to know where that supplier was in the past in a lot of cases. That's not what you need. And if you think that's what you need now, that's not what you might need in the future and then you're going to have to go back and remodel stuff and reload your data and you take your data and that's no good to type 2 is effectively super hard to manage. It makes loading. Your Dimensions harder makes loading a fax order and it makes you kind of force to do a lot of opposites and complex ETL.

It's also not a given that you're going to fall back on your feet. If you if you were to say why did they mention table and try to go back to the state that you were in you might not get exactly there? Unfortunately type Siri is kind of a bad compromise and type and some people overtime came up with so they change the mission Type 4 x 6 like people didn't know what to invent anymore. These are all kind of the same nonsense are similar version of composite version of that nonsense. So what is the functional approach to dealing with changes in your dimension is super simple as

just to Simply snapshot all the data. So that means that you're dim supplier for each day for each one of your EPL schedule yet it going to create a full Snapchat of the dimension table as it was that day. So that means that you might have I know she had like three years 10 years worth of data. You would have you know thousands or tens of thousands of deprecated data in your Dimensions. That sounds awful. Right? You're like a big Maxx heater the plicae eating all this data it slowly changing yet. You store a kind of the whole thing everyday that's going to insane. Also. There's

there's a few things about this that that I want to say that kind of mitigate that this approach. The first is storages cheap computers cheap. There's virtually no limit there. So that's one point. Then you have the fact that Dimension dimensional data in relationship to fax is usually very small and simple. So that means that your supplier that sell your company as a hundred supplier thousand Supply or even if it's like even if your company as a million suppliers taking snapshots of a million Rose nowadays in big query Presto Impala high is

a drop in the bucket. Rally, and there's also that thing where storage is cheap and turning time is expensive. This mental model is a lot easier to reason about it allows you to have a good reproduce ability, which is invaluable. I'm so these trade-offs are trade-offs you might decide to make or not make but this is the best we have and it happens to work very well in places like Facebook and Airbnb. If you want to hear to prove my point that the mental

model is easy, you can see that if you want to do a query doing a factory Dimension. All you need to do is join on the key and forced forced the latest partition. So pretty, now to ask macros in your wall tools that will allow you to do things like that. Another approach is to maintain as you say that would be called in supplier currents that would always point to the latest partition of your supplier Dimension problem solve not if you want if you're interested in the attributes of the dimension member at the time of the

transaction, what you what you need to do is simply join on the key and join on the partitioning date key problem solved. There's also a byproduct of Snapchatting. Your dimension is the fact that now you can start doing time series against your Dimensional data snapshot Simon's dim supplier could be if you want to count how many suppliers you've had over time. You could just say so I counts are from supplier through by dates and you can do a new family of queries that perhaps we're not that easy before what is slowly changing Dimension type 2 that are fairly easy to do now. So if you ever

hear about slowly changing Dimension again, you can just say all that stuff is up so late. I heard about a conference. Let me point you to an article and done you can scan and wash it off. The so now I'm going to talk about late arriving facts and keep an eye on my watch to so late. Arriving facts are common. The people at Apache beam Ave. Been feeding that horse for a while. One thing that's important about late arriving facts. If there's that you need to essentially have two x dimensions in your data

warehouse. If you have late arriving facts, you need event time and event processing time that you care about immutability and lending your stuff in your staging area and not touching it ever again and compressing it index mean it into Farsi files are parquet files. Do you need to close the loop on Mondays? So you need to essentially partition on an event processing time. So only if you were to partition based on event time, you would have to wait for the window to close it would take longer if you have to have more data and memory So as much as you can partition on

your event processing time so we can close the loop partition pruning might be lost. So this a partition pruning is this thing that did it did it did a base optimizers do wear a few apply predicate like a date filter on a partition field the database going to only scan a subset of the partitions to knowing that most queries that analyst and data scientists and then people in general will fire will be predicated on event time and now you are partitioned on event processing time. You might not be able to do partition pruning anymore. So hear what you

what you lose out there ways to mitigate this. So first is you can someone to rely on execution engine optimization. So if you're using pocket files to wear see files execution engine as it hits a certain park it block is going to read the footer and see how there's no Event like for the event I'm I'm looking for there's really nothing in that block for me. So the damage is not that good you're scanning more partitions, but you're effectively scanning more parque Footers and you're skipping the blocks and a lot of cases. You can also instruct people to apply for

tickets on partition Fields knowing your analyst and data scientist that probably won't happen but you know, you can always have hope that it may happen. You can also set Partition by event time, right? So if you know that your part if you want to fire station by by event processing time because you know, it's important to close at the Loop because could such partition on such partition an event time than you get maybe Best of Both Worlds, but more files in your warehouse It's a trade-off and you can also like if you have a lot of time predicated queries

you could also pit it. So maybe you're staging area is Partition by by event processing time, but maybe later on. ETL you rebuild a certain window of data that know that you know my change and Yuri partition the data by event time and those fact tables. Alright next. We going to talk about itself or pass dependencies and saying that. Should be avoided at all times or not at all times whenever you can. So what are self or pass dependency? So let's say we agree that we you should Snapchat your dimension to Dimension data rights. I spoke earlier about you

should just create a new Partition every day for your Dimensions. One thing that might seem really natural would be to take yesterday's say your dim supplier that yesterday's them supplier and apply a set of changes or operations that might have happened today and use that to load the next partition the point I want to make here is that there's this thing that I just going to meet up that's called complexity scorso a proxy for the complexity of your ETL might be how many partitions were necessary to compute this partition, right? So if you're if you're loading them user from

your database grave Then you might be just scanning a handful of partition and that's really easy to recreate and reason about and paralyzed if you used in user to create them user. So that means said if you want to back fill this you're going to have to do it sequin chili and go very far in the past and your complexity score for that partition goes up the roof. So one of the reason why people might want to do things like south or path dependency is in their UCL is because you might have two militants metrics or time window metric some things that some things like at least it might

be really great to have like a total number of rides for the customer, you know in the customer table. That's something that's super useful but I would argue that dough that metric is useful. Maybe it's not living at the Mansion right should metrics live in Dimensions, perhaps maybe sometimes but if you're going to compute it, I would I would in general say please make sure you don't do a cell for pasta tendency. And rely on a specialized framework that would be good and optimized and inefficient a Computing cumulative or windowed metrics. Unfortunately, this is

the scope of the stock for me to talk about a specialized framework that would achieve this. I know those things are fairly common. I should probably blog about it at some point and I'll be at the office for our next in the next hour if you want to talk about this topic specifically Now I want to talk about file explosion. So I know at Airbnb we were like partitioning everything right and sometimes subpart sub partitioning things and that leaves two that leads to a lot of files in hdfs or an S3 or whatever storage you use for a data warehouse, you know, he's partition

as I'd at least one file and he's part of the to partition to hell out of everything that your your head. Dude named know. There are hdfs RS3 somehow will suffer at some point because you have this explosion of files. So there are some ways to mitigate this one is to be careful around said partitioning perhaps being careful about having short schedule and her false write like the hourly hourly is Friday. Okay 5 minutes. If you start to get a lot of files to know if your if you do things on it on a 5-minute type windows and there's also this idea that you

could come packed. Later earlier partition. So if you have multiple years worth of data and you Partition by day, perhaps 2010-2011 maybe either way to 2016 you can you can can assemble these partitions into Les partitions. I'm not going to get much deeper into this. That's something to keep in mind cuz it's a byproduct of this approach that somewhat undesirable are that needs to be handled or medicated? So I'm already getting into the conclusion portion of this so that I've got two points. I want to make

around my conclusion first is that times have changed a lot since the argenal book on data warehousing were written right and did landscape in the technology landscape that's completely changed and we need someone to to go and rewrite these books. I'd like to write code. I don't like to ride the ducks. I'm not going to rewrite these books, but I think like some of the core things that I've changed that really changed the methodology of holiday to where I should be built. I want to talk about briefly one is we have cheap Limitless storage distributed database that the tree the databases that

is virtually infinite come to youth. So we have seen the rise of Rita from my stores and immutable file format right at the time or these books were written people had Oracle databases that were the stores that are like effectively super super mutable. You can go and change any cell in the afternoons action and all that stuff nowadays. If you using Presto Hive and Paula Deen at whatever you use your like the data is stored in these immutable segments that are indexed and coated compressed colonize

and you know the practice of going in updating stuff and you dating data not as easy anymore. But in return you get these like Rita for my stores that are really good idea to wear housing. And another thing I would like to point out that has changed is that the time where these books were written you had small teams of Highly specialized date of professional building the warehouse for the company. I think that's not true anymore. I think everyone is taking part in the computation their warehousing processing are at least of the companies that I've worked over the past five or six

years. Everyone is welcome to use and create and mute and change and shape the future of the data warehouse in the data and the company On the last one. I want to make is first learn the rules and then break them. So that's, you know, any methodology whether it's object-oriented programming functional programming whether it is this functional data engineering thing I'm talking about and it's good to know the rules, but I'm sure in your environment. You have all sorts of reasons to go and make up your own rules break those rules and as long as you know, why that you're

disgracing and why you are it's usually they're good thing to do. So, thank you everyone. That's all I'll be in the office. All right next door. So, thank you. restaurant for questions I think so really nice talk. I really like the perspective on head of taking a full snapshot of your your Dimensions every time but the question I have about that is if you're doing that don't you run the risk of turning what could be like a small data problem that fits into postgres

warehouse to a big data from. 134. Although I think it's a no-brainer for all of the very small Dimensions right? Then supplier if it's a hundred or thousand Rose a day. I just don't bother don't do any complex modeling just naturally or Dimensions if you work at Facebook and you're dealing with a user Dimension, you might want to rethink that a little bit cuz you have billions of rows that you're depleting everyday though. I've seen it done like when it becomes the practice people are so used

to that mental model and do uniformity then we're used to say working with Highland Park in the metastore people will just got to move forward with this that doesn't mean that for larger than mentions. You cannot do a mix of techniques did what you could do is you can say for my larger than mention. I want to apply some of the concepts of the type to Dimension or I might want to lower retentions for the for the Stables or I might want to do whatever we call the vertical partitioning which would be trying to keep as much as possible to feel that with my Fitness in your dimension into

some sort of fact or some sort of like user metric pipe fact table. So I would say the way to mitigate this is to keep maybe your dimension table a little bit less Fields model your data differently and maybe in some cases if if it makes sense to go with a type to approach, you know, where where it makes more the most sense. I really like your approach to kind of the persistent staging area. But let's say hypothetically a regulator and Europe decides that people can

request to be forgotten. How do you deal with that situation? So you can come and work at list where we don't really have it at your branch. Should I just yet so we don't really have to deal with that right just now, or maybe we do. I don't know. I'm not really good at legal things. So I think I think you probably at some point needs to anonymize your data. I'm not sure if it's enough to anonymize your data. So then you know, I know it's coming for companies to have a retention in anime animation framework. So that means that you would add metadata to your table properties saying

this table contains not Anonymous data, I needs to be an itemized. You got a friend a framework that knows about the field that it needs to encrypt hash and then to have this background process going to like Demon, that comes looks at the table and fold them on them self. So moves their mutable partition into an anonymous equivalent that partition and works in the background and tries to respect all the rules. So that's the best I can explain here not and I don't know what the legal constraints are exactly. But I know it's common to have these in a position

frame. Just want to ask question about idempotency. It's really important in pipelines to do things. Once with idempotency. You have any comments on idempotency and email pipelines. So soon as things we need to do you build an aggregation yard gate you sending data to build a report and then you need to hit a list what you want to make sure somehow that you only sent them once I'm for whatever reason real life story. We were sending out an email to send on a few thousand a day. We had a problem with our open tstv client or just accessing it until after

the email was sent out. There was a failure with the tstv metrics, right and as a result the job was failing and our version of airflow and so we kept redoing it. Nothing to their name. I think that you know, so want one thing is like When you're sending an email to someone that is not an item potent operation or maybe the function that sends the email itself needs to have some Port of memory and uuid until uniqueidentifier saying like when I send an email and data warehouse function I'm talking about I didn't poison email sending

function. That would say something like did I ever send this email before with this, uid, and if you know and if so do not send it again or so does idempotent function to send this email if it was never sent before and somehow you need to keep track in memory of what you sent or are knots in the past.

Cackle comments for the website

Buy this talk

Access to the talk “Functional Data Engineering - A Set of Best Practices | Lyft”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “DataEngConf SF '18”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “Artificial Intelligence and Machine Learning”?

You might be interested in videos from this event

April 21, 2020
Online
14
4.57 K
ai, ai responsibly, automation, data mining, deep learning, graph deep learning, machine learning

Similar talks

Lauren Chircus
Product Manager, Data Engineering Infrastructure at Airbnb
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Julian Hyde
Architect at Looker
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “Functional Data Engineering - A Set of Best Practices | Lyft”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
613 conferences
24811 speakers
9162 hours of content