Duration 25:10
16+
Play
Video

Secure Large-Scale Data Analysis and AI with Polyglot Data Sources By Tushar Katarki & Buck Woody

Tushar Katarki
Product Management at Red Hat
+ 1 speaker
  • Video
  • Table of contents
  • Video
Video
Secure Large-Scale Data Analysis and AI with Polyglot Data Sources By Tushar Katarki & Buck Woody
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
14
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speakers

Tushar Katarki
Product Management at Red Hat
Buck Woody
Data Scientist at Microsoft

Tushar Katarki is a senior technology professional with experience in cloud architecture, product management and engineering. He is currently at Red Hat as a product manager for OpenShift with focus on AI/ML. He advocates for the use of containers and Kubernetes for AI/ML workflows. Prior to the current role, Tushar has been a product manager and a developer at Red Hat, Oracle (Sun Microsystems), Polycom, Sycamore Networks and Percona. Tushar has an MBA from Babson College and MS in Computer Science from University at Buffalo.

View the profile

Buck Woody works on the Azure Data Services team at Microsoft, and uses data and technology to solve business and science problems. With over 35 years of professional and practical experience in computer technology, he is also a popular speaker at conferences around the world; author of over 700 articles and eight books on databases, machine learning, and R, he also sits on various Data Science Boards at two US Universities and specializes in advanced data analysis techniques. He is passionate about mentoring and growing the next generation of data professionals.

View the profile

About the talk


Share

We'll be monitoring the Q&A. See if you have any questions. You can just post those right in there. We're going to stay right on time today. We got a lot to get through. I want to talk a little bit about secure and large-scale data analysis. An AI with polyglot data sources. What does that even mean? We'll talk about that a little bit more the Technologies we're going to focus on today are big data clusters from Microsoft running on red hat openshift. And that's why we have the both of us here. So let's just Dive Right In here whenever we think about processing workloads for machine

learning or artificial intelligence. We think about things at scale and whenever we talk about data at scale we can no longer scale-up. You can only make a system so large before you run into the speed of light problems. Do you have to break apart compute and data those things have to scale out and that brings on a new paradigm and we get into problems with the caps. Roman acid databases and so on and and really the reason we had this problem is it it's not really a problem. It's more of an opportunity. But the reason we run into this are that data volumes have gotten larger and data

sources have grown exponentially and yours what we've kind of gotten that's allowed us to get all this data who got more places to come from smart watches and phones and so on. A lot of data that were creating all the time and not only that we've got a cheaper easier way to store it many years ago hard drives were very expensive, but now storage is very inexpensive and it's very easy to collect this data. We've got all kinds of new technologies that allow us to do that it it's a little like that drawer in your kitchen that you have where you put all this stuff when you first move

into your house and now you can't even get the drawer open, you don't even know what's in that drawer anymore same kind of thing with data for companies and they're not often using this. But they should be there is reasons. They should begin to use it all over the place. We need an applications of these will Zero in on one here. That's a representation. This isn't the only place but for instant Financial Technologies Banks loan operations credit card companies and so on so they need large amounts of data. They need high-speed access to that data and they need to do deep

analytical applications across it but they have a very high security footprint many times government regulations and so on and so let's pick up bank. For instance. We have the IT department that needs a scalable architecture that they can check into Source control. Even we also have the data engineering team and they need to bring in data from multiple sources and this becomes that polyglot persistence pattern we talked about which is being able to store and access data in multiple systems. Not just a relational database. Or just a non-relational database. We also think about the

line-of-business. They need to do their Standard oltp Online transaction processing and they also need to be able to do online analytical processing olap and together the new term for these things by the way is called m h tap for hybrid transaction analytical processing a lots of big new words that we have for these sorts of things. All right. So the line-of-business Nunnally wants to do the day-to-day operations and work that they do but they also want to be able to do reporting and cubes and Analysis and so on so they need

semi-structured structured and unstructured data. And of course now we have her data scientist, and she needs to be able to create models. She needs to be able to do her experimentation check in her models into Source control and so on and then she needs to deploy those models out fully call scoring. To use them to do the predictions. Well a how do we do this? Well, that's not the only people we need to worry about now. We need to worry about the customer who's going to be the recipient of this in a bank for instance. Not only do we have our data Engineers bringing in

transactions from ATM machines everywhere a line of business doing the daily transactions that you do and Reporting on that the data scientist wondering OverWatch sources of data to build a model but we want to notify the cash register or the ATM that are fraudulent transaction may be going on right now. We want to be able to do that kind of Predictive Analytics. How do we do that? We know we got large sets of data and we know we've got things we can do with it. This is just an example. In any case we have two things for scale that we need to think about. We need to scale our

Computing platform and we need to scale the data platform. So the lots of technology to come around to do this, but there are some specific needs we want to keep in mind on. Platform side we want to make sure that we have a declarative infrastructure what that means is I'd actually like a series of text files if I could get them in a in a single place where I could simply declare what I want and the system just does that before we would build computers and screw them in the racks and someone I don't want to do that. Now, I want those computers to be laying there and I want to tell them work

this way and have it do that. We need to be able to take it to the cloud or on premises or both. I don't want to have to change anything to be able to run either on Microsoft's Cloud redheads cloud or Google or Amazon wherever I won't be able to run their I also want to be able to run that same infrastructure locally. I need small units of processing so I can stand them up and put them down and other consideration such as devops and someone who quite a few requirements on our platform here and also on the database side. I want hcap. I don't want just a little Maybe I don't want just a

lap. I want both at the same time. And I also want to be able to do scale out processing. I want to put compute over storage if I can things like spark an hdfs and someone and I want to be able to talk to lots of kinds of data. I don't want to have to import data all the time just to be able to use it. I also need full security and I want the same security across everything. I also want devops. I need to check in my database schemas and my models and so on and I want relational data to work with non relational data. So this is a very tall order

Tushar. Why don't you come off mute and tell us let's let's tackle the first part here. Let's tackle the let's tackle the platform tell me more about what red hat openshift provides. Thank you Buck openshift Martyrs access to data and then integration of Black Ops platform if you already know and I don't need to read John has stopped Call Dad you start with the biggest engineering for example of how you would make you would want to have high to put a low latency and cheap and secure different kinds of a storage Technologies. And so they're medium. You want to access my Saturdays with data

moment. You might want to list. BCPS pay scale for high-speed networking access to high-speed networking and we just like fire icon and dreamless. I part out your tensorflow anyone i d e on demand Stila in the form of how do I build my artifacts? How do I deploy it in a convenient way? How do I automate at and Connect on an audio CD and all this and then what do you have if you would put them into some next video, but I just give it you can do it. You can go to learn.

This one is specifically on Young and machine learning. Come pick it in once also if you want to this one will is and then you see that you can go to the tutorial itself. And if you want to face that you would be doing the moment, you know, you can and then you can You you can yet see that time just a moment and then you'll see how you would launch a jupyter notebook and get your waist and spelling sent you the template and you'll see in just a second. Then you'll go to college and then

you can start talking a notebook import you can cause aching feet if I don't see how you can use open. Open chest.com. You have to do it. You can do it today. If you choose to get an idea of what we're talking about. Well, we're we see now we have a platform we can scale on and it can do lots of things and one of those things we've done at Microsoft as we've used the same platform. So we stand up a red hat. Open shift and we have these as I mentioned a both in the cloud and on-premises in both works exactly the same way and I've abstracted a great deal here. There's there's more information then

obviously I'll show on the screen but I will give you a full resource for those sorts of things. But basically we provide for you a full SQL Server system and it looks just like SQL Server always have but it's running in Linux. So, there's no changes. We made to the engine or anything else. It's just regular everyday sequel server. So all those applications will keep working it working. But we also provide a full series of SQL servers that you don't have to talk to the primary one will do that for you for things like olap. If you want to scale into the terabytes in a petabyte of data and do

those sorts of things you have that relational storage, but you also want non-relational storage you like to work with spark and hdfs. So in the same deployment, we automatically deploy Spark Hdfs and another SQL Server that can talk directly to hdfs as if it's a database on others kind of interesting and then you can deploy application all inside the same security boundary here. I and you could see that there's a i scoring for this area as well. Now this is the this is the infrastructure itself. But let's talk a little bit more about how you would

actually deploy a real example of this. Hello. Hello. And how are you good to have you here? We also have red hat openshift that's showing up here that we've just reproduced when it was like this is our bank example, and we're just going to go ahead and let everybody do their specific work. So the data engineer can now bring data directly into the date of pool or skinny the storage pool and the and work with that the oltp the person that's doing their work. I can do their reporting their transaction processing and they can access That data that's an hdfs and then the

data scientist can create her model and use polyglot sources to do that and then deploy that out to an application pool to do this. I'm going to pop up the last scream really got a couple more minutes here, but I'm going to let you see this link. So if you want to you can take a screenshot of that link right there, very small very simple and we've got several workshops on SQL Server Big Data clusters in the moments. We have left. Let me pop over and show you I'm using the Azure data Studio, which is a multi-platform tool works on Linux Windows Mac. It looks exactly the same as free and I've

connected to my big data cluster which as you can see has various databases and it it's just SQL server, but it also has access to hdfs. So I'm going to hide that to get a little screen real estate here and let's take a look at each of our personas. So here's our data engineer and they want to be able to read credit card data from a web store. So I'm just going to use High spark of that python a spark inside the spark in the in the storage pulley mentioned a moment ago read that data in. Oh, I'll take a little look at it and just see what it looks like maybe grab it is a dataframe print

the schema. And then all I'm going to do is just write it out. I'll just write it right down to transaction credit card. So now I have lots and lots of files and transaction credit cards here that I would write back out just using standard spark call. So that's my data engineer. They lived in this world that hit hdfs and so on just like they always have now what we'll do is we'll have the DDA the database administrator begin to make some applications and here's what kind of interesting using only a couple of connection strings here. They can instantly hit that hdfs and

create an external table know what this means is the date of won't be brought in Italy merely be referred to and the query will actually run over in hdfs. Not in SQL Server. However, I can treat it just like it's a regular SQL Server database. So now I can do anything I need to do in SQL against that table just as if it were local I could join that to Oracle. I could join that to teradata to db2 to AWS S3 storage and so on and even join them using standard constructs when I'm all done. Maybe what I'd like to do is create that data model

of that machine learning model rather. And so I'm going to try and see if a neural network will beat a classical data model. So I bring in some things in Python here. I just use sklearn and pull it down and do a just a very very simple model and I create that and I get my confusion Matrix. I take a look and see how I did with my accuracy and precision and recall my confusion Matrix here and then I try on your own network and I run it through about 1,200 each box. Here's a very small data set. And in this data set with this particular data, my

experimentation shows that while the accuracy of neural networks for a better. It's loss was Far higher. So in this case classical machine learning is the right choice. So she does her experiment. She saves her model and then she persisted out I'm using pickle or whatever else you'd like to see what this gives for us. May pop back over to our spread jarosz screen hear it gives us the ability to have a secure scalable Define infrastructure that can run on the cloud or on premises and get us all types of data in a single data Hub forget about 10 or 15 seconds. Not sure if there's any

questions in the in the link their Tushar if you just handled all those we really appreciate your time today. I appreciate you listening to us there and we can go to questions if we have any Girard we have any questions we when I cover I don't see any so Maybe if you're free to hit up this site and there's a full couple of workshops that you can have. In fact, perhaps in the moments. We have left I can even see if I can bring that up for us. Are you can see my screen there a k e m s forward slash sequel workshops is the name of that location

sequel workshops. And when you go to that what you'll see here, let me give us a little better a better resume. I'm a little bit older. So I it's hard to see that and you'll notice than our SQL data platform. We have a few things here. We got some workshops that you can see and we've also got some labs and so on we've got SQL Server 2019 on openshift. But we also have SQL Server Big Data clusters. And when you click on these links, there's a full get hog that we've written for you. My team has written and we've got everything you need to know. If your cut the new de a Linux or or

those sorts of things if you're new to kubernetes or open shift and someone just like to char showed you how there's lots of different places. You can go learn we give you a little scenario and we talked about this particular one is point-of-sale system and explain if you don't know if Linux, I've got lots of resources free to go learn that all of this is free to learn and you'll be able to take this course takes about nine hours to work through this entire course, so I think we're pretty much close to time. So Lina, I'm going to turn it back over to you and if there's any other questions or

anything else that we need to cover. Please let us know unless you sure you got some closing comments. Yeah, right. Like we have 5 more minutes so you can share something as if no, no question. In the meantime, from my point of view. My Audi been continuous integration continuous development to data scientist that keeps getting a two or three major points. When is the topic which is I need access to public private data sources. How do I make it repeat it? Right? Like

it's not like I need you. I submitted service desk and it takes 6 months for me to get access to that. How do we automate it? And I think of you Can make that happen if you have a bunch of running on top of a open trick by the bay on I can run on actually we have openings which we host or one we do it Microsoft service engineer and a store to Microsoft. You can go there. You could get it to the right image of Gordon code. Turning on example, if you're mortared

the monitoring, you know that I made that basically everything is in these two boxes. If you get openshift, you got the secure application infrastructure that you need to do. Not only what we're doing here. But much more you got to consistently to do at whether it's on prime or in the cloud and then when you install SQL Server Big Data cluster is as you can see you got oltp olap non-relational data and data at scale. You're also able with the polybase feature to reach out and touch directly without importing

the data really making Anna polyglot system so that you're able to talk to any of your data either through Spark. Where to transact SQL or boat that you can lay down as we showed me example files from spark that you can then read inside transact-sql. So it's the best of all world. You can see the entry points for each one of these particular audiences here. So I think we are at time at this point Lena anything we need to do to close out.

Cackle comments for the website

Buy this talk

Access to the talk “Secure Large-Scale Data Analysis and AI with Polyglot Data Sources By Tushar Katarki & Buck Woody”
Available
In cart
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “Global Artificial Intelligence Virtual Conference”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Buy this video

Video

Access to the talk “Secure Large-Scale Data Analysis and AI with Polyglot Data Sources By Tushar Katarki & Buck Woody”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
561 conferences
22100 speakers
8257 hours of content