Table of contents
About the talk
In this session, learn how to build a secure and automated data lake using AWS Lake Formation. Also learn how to set up periodic sales data and ingest into the data lake, build automated transformations, and generate sales forecasts from the transformed data using AI. If you're a developer, DBA, or a data engineer who works with data, this session is for you.
Learn more about AWS at - https://amzn.to/2C6q0qN
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4
#AWS #AWSSummit #AWSEvents
Over the course of my career, I’ve coded software, technically designed high performance and mission critical systems and liaised with business stakeholders to deliver value through technology. I’m a strong believer that Technology is always an enabler to some real-world value therefore the passion to adopt the latest technology must come with a pragmatic desire to deliver value with it. I also believe that Cloud and Open source has changed the world forever and adaption of both is not only an inevitability but necessary for the survival for organisations.View the profile
Pie folks, my name is Sarah Jeffery. And I'm a Solutions architect at a w s. Today we're going to dive deep into edible Lake formation. That's kind of what we're going to look at. Began to look at. Why was like Mission exist as a service and what problems ourselves would look at ingest and transform capability within edible exclamation. Look at security and access control. We look at our Discovery in collaboration. And finally, we'll look at ordering and monitoring capabilities within Oedipus Lake formation. So it to really understand why it was like Mission
exist of the service and the problem that it solves. Let's look at it from a lens of a real business use case. Let's just say that your daughter engineer and Retail organization and the general manager of your retail business has asked you to build a modern sales forecasting capability. What do you think about this problem at the first thing you need to do is to realize by the data sources and that you need that have the data that for that particular use case. In this particular case, you've identified a relation that has a stay of the Newman for history. Of that you need. Okay, right
next you need to start to build transformation logic that will take the raw data and transform it in a way that can be used for the purposes of forecasting. I finally, because you're operating on getting this platform, you realize that you have access to a range of our services that may be able to deliver a much better outcome. And in particular, you were done. If I'd Amazon forecast that can give you a more accurate, and a better outcome of this use case. Bring this problem down further into technical components. The first thing you need to do is to set up your data like in
particular, setting up your extra buckets. What's the time that then? You need to stop to ingest out of mere. Also Starbase and into your daughter Lake and you need to build an engine that can be run, repeatedly robustly and more resilient Lee What's he done that? Then you need to go to start to cleanse and prep and transform the data so that it can be used by Amazon full cast. Once you've done that, then you need to start to think about security and access control, who has access to what data and what level of access do they have? And how do you manage that access going forward?
That's easy and scalable and finally, you need to make that data available for collaboration for all of your users so that they can discover each other's data sets and be able to deliver valuable insights for your business. All of these things are core capabilities that you need to be able to deliver any dollar project, but at the same time, they can take months to deliver. I'm taking valuable time away from what your business with your kids about witches, in this case and into in forecasting capability. And this is way it was like the mission really comes in and gives you these
capabilities out of the box so that you can focus on, delivering the inoculum for your business. Let's look at each and every capability in a bit more detail starting with ingest and register. What's going to a demo? I'm looking to the end of his console here. I'm the first thing I want to show you actually is the S3 console and willing to strike and you can see that I've created three buckets here, the landing bucket, a process pocket and I published bucket. Lightning bucket is obviously where the road is going to be ingested into and swim to land.
The process bucket is where the transform. Is going to be placed from landing. And then finally, the published bucket is where the results of Amazon full cast going to be exported into and published too. And that's where the end users will actually draw reporting and visualizations phone. now, if I open up the landing bucket, I noticed that there are several subfolders within it and each of these folders represent data sets for the different business units that you have in your organization and the project that you're dealing with an hourly spoke to the retail demand and the. It
will be within that particular fold and you want to make sure that when you set up your daughter like that the access control and governance his narrow to that particular scope. Okay, so going back to the others, so I'm going to go ahead and open up the edible! Soul. And I'll take on dashboard and I'll see you. There are three stages that I need to complain. To the first stage is to really register my history locations that I created over yet to make one part of my Dale. I'll go ahead and register my Landing bucket.
Any particular, I will narrow it down to the retail folder so that delayed formation scope is is confined to the retail folder. And I'll go ahead and click register. Okay, but now you can see that that particular location is not register. Davis Lake formation for the purposes of security and governance, and I'm going to head and pre-registered the published and the process locations as well. Okay. Next, I'm going to create a database on top of my daughter Lake location of database, will allow me to create logical
constructs, light tables on my daughter legs, so that my users can easily access and query using SQL. So I'm going to go ahead and create a landing TV. And it will point to the dollar Lake location for the landing bucket. And I'll give it a a description. And I'll create. Okay, so now you can see that. I've created a, a landing got a base pointing to the landing extra location. I'm going to have recreated the 114 processed and another one for published as well. Okay, finally, I need to Grant access on today's databases to an. I am role that
will be assumed by the various jobs transformation jobs in Jessup, jobs that will run to bring her into this when transform the dollar show. In this particular case I have a an IM roll code for floral Affleck to die and I'm going to give it access on the landing TV. And I want to provide access to be for it to be create table. Altering drops, I'm going to be at Super and the super was a combination or Union of all three of these access levels. And as you can see, now that the that role has now got access on to The Landing TV,
And I want to go on ahead and pre granted access to the other two databases to the to this particular. I am role as well. Okay. So now I'm set up with my daughter Lake. The next thing I want to do is to bring data into it or ingest data. So I'm going to use a built-in capabilities and it reflects formation for ingest I called the blueprint. Blueprints give me the option of connecting to a relational database such as a my sequel postgres, Oracle hostmonster, SQL server. And I put two options here and do a snapshot or incremental ingest. I'm going to choose an incremental
ingest which means that it that it will bring in the first snapshot and then subsequently lonely bring in the Delta changes. I'm going to select a connection of data created earlier in a glue that allows me to encapsulate the credentials before hand. So I don't have to talk to me nice and every time and now provide the name of my SQL Server database, which is Tails TV. Which is my source database with the sales history and the name of the table to school sales. Next time, we provide the name of the column that this blueprint will use as a bookmark to identify what it has
already invested in with the new. Delta changes are And then finally next I will select the Target location for where the daughter is going to go into and that'll be my daughter leg. Obey School Landing DB. Shuffle my being cocky and I have the option of running it on a custom schedule with the crown schedule. But in this case, I'm going to run it on once a day early and then I'll give it a look for name. And select the. I am role that I previously brought it up, permissions on. And then a name prefix for the tables that this blueprint will create.
Okay, I'll hit create. So what is Bluefin is doing under the covers, it's actually creating an orchestrated workflow of individual steps that will become a run to reliably and robustly and repeatedly, by ingested. Up from you. So starter base into your data Lake. Just going to go ahead and create that work for Indiana in the background. Here is just click Start. Now, Okay. So it's now started. So that actually have a look inside what this work for looks like. So let's click on it definition and click on surround
View. And it take me to glue workflows console and if I now open it up, are you can see that? It's a pain to run time, view of it. Executing and started with the first step here with just a pre-trial. It'll do a few other trolls and then it'll do a concurrent ingest from you. So start a base and then we'll finish off with some post collectivities and all of that has been built without you having to write any code. Okay, the other thing I want to show you next he's under the covers. How does atavistic
formation actually in just the data from Lawton to my SQL query to hear and I'll be enabled on my source database of enabled, our database or eating which will then Monticello SQL, statements, aquarium. That I can see the SQL statements that are being executed by the end of this Lake formation blueprint and you can see that he's doing a select star from the sales table and it's using He's using the invoice date. I'm calling to to determine what it has already invested in what the new changes are. And this is important point that I want to talk about here is the message
through which it signed just in Dallas. Typically speaking, there are two ways that. I can be ingested from your Source. Relational database one is to basically run a select statement on your Source database and bring the daughter of which is what Avis blueprint is doing here. The other option is to stream the change out of these changes in from your soul, started relational databases redo log homes of both approaches. But the thing to be mindful of with this particular approach of doing a select star, on the table is that if you have a really busy. A base, and their changes are coming in very
frequently. When there is a possibility that the select statement may run in between a live transaction and if that happens that you may end up missing the data for that particular transaction. Also, it could also be that, you know, this Lex statements I may come in and impact the performance of a really busy. A bass. I'm so if you have operating in those conditions, besos databases really busy. Then maybe think about using the video log asynchronous, ingest approach. And if in in that case, you can use DMS for that particular purpose, if you have access to read replicas
or your database is not so busy. Then our approach like this which Avis blueprint use Alexa Mission blueprint uses, can you call a simple and robust way of bringing their daughter into your data Lake? Okay, so now that we have that setup, it will it run for a while? That's actually look into a bit more detail as to what Elvis glue work, clothes are and how you can use them for building. Use case allow you to stitch together, individual Blue Jobs crawlers, and conditional logic to be able to deliver a complex outcome.
So, in your case, because you're building an end-to-end forecasting flow, you'll be using it as blue were closed to build your execution logic. This is what it would look like. That I will arrive in your S3, Landing bucket and that will trigger off a Lambda to kick-start the Edibles to work flow. The workflow will kick off the transform glue job. So that it takes the rotor and transform it in the format that usable by Amazon forecast service. After that, the blue workflow will then kick off the three individual steps of Amazon forecast, namely importing the data into the service training,
the predictor on the data and then generating a full cast and exporting the results out to The Landing Bay. What are the public parking And finally, the workflow will kick off a crawl out to run on the exported focused on a set so that it's usable bicycle. Amazon, Athena and visualized by Amazon quick side. So this what your solution looks like, you have the three buckets, The Landing, the processed and published and their security and governance is managed through edible exclamation, you have the address. Let formation blueprint that's interesting. Up from the rules to dinosaurs and
into a landing bucket on a regular basis. You have the Edibles blue workflow that is Oaks training the rest of the automation for your Amazon full cast of full casting process. So starting from transforming the data and then kicking off the Amazon full cast of his steps and then kicking off the crawl out to run on the explore, the doctor said and finally run is sending a notification to business stakeholders why Amazon SMS. Dallas. And look at the table to see how that actually works in the console for going into the edible glue console again, octagon, Workforce,
And now click on a custom workflow that I've created call imagery Focus demo. Open up this expand. This work flow. You can see that I've created each and each individual step to starting with the transform. A step. And then the assuring the importing, the forecast job in terms of full cast train the predictor during forecast in someone and that that workflow finishes and runs the end when I'm forecasting pipeline. So let's go ahead and kick off the particular work for And let's open up
its runtime View. And as you can see, it's now kicked off the first step, which is the transformation of the data from from Landing to processed for approximately an hour or so, to to complete From end-to-end perspective, but that's now move on to the next capability, which is security and access control groups users into two categories. Is it like a demon's run and operate? The data Lake Define insecure the storage that uses an original to my eyes dilate. And then this tablet consumers
that basically generate insides assume the daughter and sometimes are the custodian of the data sets. So in your case, there are three percenters that you need to consider this a darle cabin which is you and then there's a retail manager who's the custodian of the data set and then there's the analysts reporting to the retail manager and will be running the reports and SQL queries on the data as well as building visualizations. Let's have a look at the demo as to how you would apply security and access control for these personas
automation. Console. Click on tables. And as you can see that there are two tables orders and products that have been created by the transformation job into the processed TV and I want to Grant access to both of these tables to the retail manager. So I'm going to go ahead and and give start with giving access on orders table to the retail manager. So, click on Grant. I will select reason either and I will give him select access on that table. I want to be able to give him a grand total commission, which basically means that they can then self-service themselves in and providing any other
anyone else access to this particular table. So that way I'm getting out of the way and democratizing access and enabling Self Service to those who are the custodians of this daughter said. And in this particular case, the retail manager with the table, Okay, but now that I've gotten those access, I'm going to log in as retail manager, using a Firefox container tab, which allows me to isolate user log-ins and I'll open up the Erebus Lake formation console as the retail manager. And when I click on tables, I can see that have access to these two
tables order their products as well as scheming. You can see it was Tom stem, item id, Daman location and says custom the name Okay, so next I want to show you also I'm logged in as the Animus user is, well, again, using another container time and I'm going to open up 80% commission and tables here, I see that I don't have access to any of those tables just yet. So the retail manager is not going to self service and and Grant access to that table to anyone else in particular the Animus user. So they will bill, go on and click Bronte and select the endless user, and then this doesn't give us
select access. In addition to that, they want to be able to restrict one particular Collins from the view of the Endless in, in in particular, the customer named column because it happens to be sensitive information. And they don't want that information to be visible, to be honest. And it's not necessary for those that exclude that column from you and they could prompt Okay, so now that's happening fun. Now go to return. The endless used to view and and click a refresh. I can see that there is an hors d'oeuvres table available. And If I click on that order, stable, as the endless Tuesday,
I can see that the scheme only contains four tables and in particular its it's missing the table of the color, sorry, for columns, to missing a column within the customer named problem. Go ahead and use the data of that particular table. If it'll run the query on the table are using Athena and Athena will show only the four columns that this endless has access to, even though they'd actually ran a select a statement on that table, with the fact that there is actually a column name, customer name because they just don't have access
to it and log into that was using it as an analyst, use a credential and, and try and query the same orders table that I have access to now. So if I query that table from this third-party using the Athena driver, in the background, I can again, I can only see the full columns that I have access to as an analyst user. And that's because Tina is the driver that is to lose using and Athena. I once I can integrate to Lake Mission. Collins out, deep into house security
and access control works with an edible schlechter Mission into Amazon Athena. Does it actually queries? It was like formation for the credentials that user has Lake commission looks up the access Define for that user and returns back some temporary credentials as well as a scoped down. I am policy that represents the level of access to use actually has on the other line. Amazon Athena uses these credentials and and Josiah and policy to Dan access at the End of Line storage. I'm to retrieve the results sets for that satisfied.
The query that use has issued. And then before returning the results back to the user, it identifies any call him that the user does not have access to and filters those columns out, and then Returns the results back to the user. You can see that in this case you did not have to Grant any direct S3 permissions to any of you use has all of that was managed dynamically by Oedipus Lake formation and the use of scoped I am polishing. Okay, now, let's look at Donald Discovery in collaboration with the nativist lakes Mission. So let's go back to the demo and let's see how we
implement. Tell me to the again, once again, in the lake Mission console. And I have these two tables discoverability of these Tables by others and easy discoverability. So I'm going to go ahead and edit the table, and I have the option of adding a custom tags that actually have a business meeting to this particular table. So because this table belongs to the retail, Domaine I'm going to actually add a tag called the main And call it retail. And I'm also going to add another tag
cold sensitivity, because it has some sense of that tag to be pii. And I'll hit safe. Okay. And now I'm going to go and open up the products table, and I'm going to edit, it's on that table as well. And then I'm going to add a similar. Tag pulled the main equals retail. Okay. But now that I added these tags, what that allows me to do is actually any user without a leg to be able to search the catalog for tables that belong to have a meaning in in the language of the business. Imagine when you. Allegro's am you might have paid you might have
from ten hundreds maybe even thousands of tables and how do you use a zesting? Discover the data and how do they collaborate say for marketing team wants to be able to use the retail. The main data for cross collaboration, how did they discover that they would go into the catalog? And they will type in show me all the tables that belong to the retail demand. So they were talking to Maine Eagles retail and when they hit, enter the visibility of the two tables that are available. As part of the retail demand that they can use. Similarly, if they were asked, if somebody was
looking for any daughter that had pii information in it and they could add the tag, the search for Tat cold to see a sensitive Eagles P. I only see the table that has a particular type, really, and Abel's people to it and uses to be able to collaborate find and discover daughter and and make use of it in the way that best serves the purpose. Okay, so that's one aspect of collaboration and which is that reached a catalog. An important aspect of data collaboration is to be able to visualize the data. In your case, you'll be clearing your daughter. Be exported full cast,
using Amazon, Athena and sequel. I'm going to see now is an interactive service, clear engine that runs Presto, which is an open source distributed Curry engine under the covers and let you clear. You don't like using standard sequel. 10 using Amazon quick side, you'll be able to connect Amazon Athena into a dashboard to visualize the data set, including the export forecast for your business stakeholders. if you have a pre-existing preference for a visualization tool of your choice as long as those tools enable access or connectivity by odbc, odbc, they can also be used with an
Amazon Athena Okay, let us look at monitoring and ordering a horrid capability within a reverse light formation. So, let's go into the demo again. So I'm logged in as a Dale Cadman into the Lake formation console. And if I scroll down, I can see a list of all the actions that have that have been happening on my daughter Lake on two different types of events as well as the principles of the users, that have been accessing a request in, for those events. And and also, the time stamp at which each of those access requests
were were accessed or executed. If I open up a particular event like that. Our access, which is executed by the analyst user, Attic new event. You can see that it provides a very comprehensive set of information about that particular event for things. Like, you know, who was the use of that, that I requested, that particular event. And what was the access key ID that they used the timestamp of the event itself? Be the type of event that it was with a blizzard out access. Was it at the tool that they use to execute that event? The top
of mission is that they had at the time of executing that particular requests as well as the the table that they actually accessed. You can see that it's got a lot of information and Rich set of information that you can integrate into any of your cooling, for security, at events management, such as Steam or otherwise. Any other automation, that unit build to monitor your access. This information is also integrated with AWS cloudtrail safe are going to Club trial and look at the, the dashboard bear. I can see that these events are
also being published in Clash Royale. See if I go into all events and search for a particular type event name in this case, get directions and I can see the same information in Clark Trail. That I saw an individual exclamation build automation on the back of this so that you can create alarms alerts on notifications, based on the type of event that occurred all the type of music that request that event or any other condition that Put on that rounds off the monitoring and auditing using a single-pane-of-glass within Oedipus Lake Commission.
Thanks for joining the session, folks, I really hope that you enjoyed it and I look forward to you using a device like formation to build your use case. Please also remember to fill out the survey for the session. Thank you very much.
Buy this talk
Interested in topic “IT & Technology”?
You might be interested in videos from this event
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.