Duration 29:58
16+
Play
Video

Platform Oriented Data Architecture for Acceler...By Vladimir Bacvanski, Principal Architect, Paypal

Vladimir Bacvanski
Principal Architect at Paypal
  • Video
  • Table of contents
  • Video
Request Q&A
Video
Platform Oriented Data Architecture for Acceler...By Vladimir Bacvanski, Principal Architect, Paypal
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
53
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Vladimir Bacvanski
Principal Architect at Paypal

Dr. Vladimir Bacvanski is a Principal Architect with Strategic Architecture at PayPal. His work spans Data Platforms, Privacy, and Developer Experience as well as the introduction of Advanced Technologies. Before joining PayPal, Vladimir was the CTO and a founder of a custom development and consulting firm and has advised and worked with clients ranging from high-tech startups to financial and government organizations. Vladimir is the author of the popular O'Reilly course "Introduction to Big Data" and a coauthor of the O'Reilly course on Kafka. Vladimir received his PhD degree in Computer Science from RWTH Aachen in Germany.

View the profile

About the talk

To embrace productive data architecture, many organizations introduce centralized data lake fed by operational data stores and data streams. However, that solution introduces problems of quality, governance, and delayed access. This hampers the agility as the consumers such as ML and AI teams deal with inferior quality data which are difficult to use and combine due to the lack of standards and governance. Recently, a new form of data architecture appeared, which is centered around the intersection of platforms, domains, and data.  In this approach, influenced by Domain-Driven Design (DDD), the team that owns the domain owns the operational stores but also the domain specific area of the data lake, and they handle quality and governance of the data in their area – resulting in a Data Product. The result is that consumers can rapidly combine high-quality data from various sources and put them together, resulting in accelerated insights. We will discuss the architectural, organizational, and technical underpinnings of Platform Oriented Data Architecture, based on the set of common data infrastructure services.

Share

All right. So I thank you for the introduction. You have my coordinates here. So one of the things that is really always on top of my mind is how we develop better software. And how do we deal with data that is produced by all of this software and that is naturally encroaches on the areas of both software architecture and data and a traditional there is a cultural Gap in between the two and they need to merge and in this talk. We are going to look into some of the opportunities and challenges to make this integration of software and they decide better

with the end result that we are able to consume that they said that is created by our various Platforms in a more effective way and basically bringing the agility in the organization that is processing data. Even though it's the talk about data. We are actually going to start with software in and move from services to platform explore some areas of how they tell egg become there. But once the friction that exists in processing a date a large organizations, and then we are going to look into the way how we introduce a

proper relationship between software and data and talk about the approach that is becoming known as the datamosh. So if you have been pulling software development in the last 50 years or so, you will not escape the notion of a domain driven design and domain driven design is approach to development which is based first on understanding the problem domain. So this is something that is known as ubiquitous language. I saw you build a model that is focusing on entities that exist in the

problem the man you look at their properties and relationships. So you end up with something that is like a logical model that describes this problem. And then in the development to look into this problem the main you looking to water the responsibilities and then for your software components, are you identify what is the key responsibilities of that component? And that isn't how you organize your software components and services? And the question is where are you going to put the boundary around that and this is something that is called the bounded context a bonded

context is self and close area that is representing one important part of the business. For example, dealing with customer management would be one bounded context a shopping with me another one marketing 1/3 1/2 all of that Elizabeth systems that are built more effectively and more importantly you have the great resemblance because your software structure now resemble the structure of your problem to me. Add some developments in technology and that leads us to such system being

expressed as a collection of services. So we have services that are exposing the cup abilities through interface. So this little symbols here indicate the interface that is offering functionality for other services to call and it is a half circle is what the service can call by reaching out to other services than making calls there. So we are creating the set of services and daddy dominant approach to that is known as the microservice architecture in this microservice architecture. You

have this clear notion of responsibilities, but then the question came when we look at these services and groups of services. What is the mechanism that allows us to combine them in such a way that you provide the more veggies to the end users and enable them to to combine the capabilities for effectively and the concept of a platform. So the platform is a collection of services that work together and provide something useful for developers the important directional and Thinking Beyond that is that the

apply forms are still providing the apis that is called by clients. The API recorded service provider interface with a platform can reach out and call somebody from outside to keep the implementation hidden. There are three additional aspects around the platform one is we are paying attention about the making easy connection. So we pay attention that d-cup abilities are exposed in a way where customers can easily understand what is being offered so they can combine the services and billed and you

The services that are exposed need to be attractive. So we are saying that certain platforms have a gravity and that is something that you see in commercial ecosystems where you have the software-as-a-service product those that provide exceptional value to their users have strong gravity and then the users are gravitating towards them and finally the platforms that I need to provide the notion of Law and that is the flow of exchanges of edges and basically that indicates how easy it is to combine daddy Services of the

platform to build a new product to create new experiences something that is not possible by doing using Services of all the further evolution of services particularly going to light assistance. So here we have this platform and when you look into what is done in the price from you will see that there is a set of components that are implementing the functionality of the platform components are hidden by users. Now. This is where you would normally at the literature on services and flights from stops, but there is actually more to

that other important part of the design is that platforms include also variety of datastore implementing things like sexual processing things that are happening right now in the business are going to be reflected in this Saturday at the store. And if you go a little bit deeper, you'll find that the modern platform are not just going to use one a database. But in fact there is a whole range of things that can be used as there are they to stores Operational stores off of relational databases hellacious actions, but also when you need inside about things that happened in the past for that you

would like to have analytical store and be greater stores are very popular choice in this area. Another thing that is becoming very important is the processing of things that happened in real time. And for that we need the processing of strings Apache Kafka is one of the dominant examples of such stream processing and if you look at this picture, we see that I'm looking at the softer side is not enough anymore. The data side of platforms can be significant and very complex and it can be addressing operational analytical and

the real-time workloads. And the question is the price for Ms. Now using these services, but how is the whole Enterprise going to benefit from that and this is where we are coming to the idea of a let's collect the data from various sources in the Enterprise with them into one place. And this is where we can analyze it and offer it to Consumers and that leads to the notion of the data Lake know in the data Lake approach. You have a situation where the platform are exporting the data that have been created and

how old is data is supposed to go into the Titanic. Here are some problems with this approach. It turns out that the idea of the data Lake where the systems will just dump the raw data into the data Lake and then the consumers would be able to apply Transformations on read in order to extract the value from this data. That's awesome degrees particularly enlarged eight legs and you end up with my phone the top producing some data, but they're also producing someday. That have poor quality as an example. Imagine that you are dealing with customers who have the customer platform

and the one of the things that this platform is doing it is collecting the contact information of users know not to all users are going to give them give us their proper. First and last name, but you may find interest where the customer saying that is the Mickey Mouse customer data like this is going to be collected and ends up in the data lake. So now your data lake has some day that is correct. And it has some days are better than the poor quality the consumers of this would like to use this day time today the lake but then they're finding lots of problems and

they are not happy with a solution. So what is going to happen in a scenario like this very interesting situation that's really big raise the cap abilities of the net electric. We have a date a fraction that is friction a ghost from the happy producers of data to unhappy consumers. We have a data producer teams. This other teams that are on in the platform. They do their processing. They collect the information. They collect information about this user. So Mickey Mouse

is one of the interest is dumped into the day Catholic know the team that produce the betta is happy. They are happy with their transaction processing and the day today create. They just send it over to the day like All the way here on the right we have consumers of the data analytics department when they would like to analyze information about customers the ex's did they thought they find the Mickey Mouse and they're not happy. So they go and talk to the data Lake theme. So the Dana theme that is typically small is far removed from the source of this court date

yesterday are asked to fix it. And this is why and organizations often see the data quality maybe third-party tools are being used and developed day today trying to fix the problems. The big issue here is that now we have unhappy consumers of the day that we have the beta team that is asked to resolve the problems of which were not caused by them. And the big problem is that this data team is actually far removed from the source of data that actually produce this incorrect data and at least to greatest rain and the Bee

Problem is that you cannot actually kale such a data team for very large Tomatoes. If you have a small organization and how much are you going to have a date a team that is actually taking care of the day, but imagine a very large Enterprise or Enterprises that is acquiring companies that are in your business has no real chance of catching up with the subject matter knowledge for all of these demands in order to be able to do data work for the consumers. So what should we do? The first thing is all we need to

understand the split of data that exists in the domain. If you look into platform you are going to see that the operational part is candle that measured in the platform and then they are analytical data are dumped into today the lake and conceptual problem here is that you have a split of data. They said that belongs to this. The man is Parkland the platform and partly in the data linked together are in two very different place. So what we need to do and this is the essence of this proposal the notion of

the software platform should not be limited to the software-only, but in truth also included the reasoning about data inside of the platform, so they will be datastores. There will be pipelines that are removing a data from one place to the other doing various information. And one of the key things is that besides the software interfaces the platform should offer data interfaces and this other things through which the platform consumes the data from other sources Kafka stream, or if they need to ingest file that contains additional information

and the critical thing here is that the platform should offer offer of the notion of a data product. So besides the API They will be treating the day that they produce as a product and that means that would be defined with the same care and Regal as their defining their ATI. There will be a quality that will be service level agreement. All this data is obtained and you have essentially a pride of ownership of the day today. The platform is producing. And now what is really fundamentally different in this approach

is that the team that is producing this data and the team that is responsible for quality is the team that owns this domain so they are arrived at the place where the data is produced and if there are some issues that need to be fixed. They are the closest to the problem. So they are the best equipped to deal with a better product import. I got weather control apis are used for monitoring auditing and and other typical income on data services. No. We are saying that at least at least two accelerated inside then and then why is that song hypothetical application where you have a customer

platform that uses data about the customers you have a platform that is keeping track of customer interactions, like what websites I visit what as they click on and imagine if we would like to improve on the richness of the data. So we obtained a third-party data file that gives additional detail about the customers maybe from some of the e-collar stores in Saint and now the marketing platform can use the data produced by the customer platform customer interactions and these third-party Falls they will be

Using this data source for me to do the various processing and they will create be there data product, which is the marketing recommendations for individual users in this scenario. Let's say there is a problem is the quality of data by the customers the marketing team. Who is using the data can recognize? Hey, we have a Mickey Mouse is a customer. This is not right. Let's go to the customer platform and talk directly with them. So what they can do and this is very very concrete example, for example of the ingestion former user is entering the

data. You might do a little bit of checks. Are you improve the data quality a day ingestion as opposed to trying to fix it all the way there when it says use for Analytics. Stop in this example you have that the data leak team is completely escaped. They are not involved in dealing with wrong data. But what they will do I will see that in the next line is they would be providing the flights from infrastructure that is allowing all of this cup abilities. So one of the really important things you

can think that each of the platforms will be actually operating things that we see in the data Lake and When you look how is that typically applied particularly in the clothing garment. You can say that all of this data product. can actually reside in the same Cloud so physically they are next to each other but there is a logical separation and logical ownership where the ORD platforms are owning and they're responsible for their date. So so you have relationships between the numbers of data and the producers of right. Now, there are a couple of

important things that we will need to address to make this work smoothly. Stop for this to work. Well, you need to have a couple of things. So inside the platform you need to have the mechanisms for storing data for processing data. You need to have mechanisms for data movement and song. Now the question is is every platform going to invented this mechanisms on their own and the answer is no because that's would not fail and you would quickly end up in a chaos. So what is

essential for this is to have the infrastructure platform and infect self-service data infrastructure platform. This platform provides a variety of data capabilities as self-service interfaces. So there is a way to create a file in storage. There is a way to create them like they are my SQL or Oracle database as a service. You may have a variety of other databases like a graph database. You have a cough cough for messaging. So the user for example and does not need to set up and maintain a Kafka cluster or install my SQL database. All of these will be

provided by the infrastructure platform. And the user will be using the interface which is saying I would like to get a cop car atopic. I'm expecting that many messages per second of that size to help a little bit with the selection of the implementation and all of these capabilities are provided by the data team. So the data team is dealing with infrastructure. They're exposing all the data infrastructure capabilities through a set of polished cell services. And when I say polish their

service that means that you can go and use this cup abilities without being forced to talk to anybody. And with that the owners of the platform can use the services. And then they can create for example ingestion from the following object storage. They can put this data in sap query do some processing there. And then that they may expose these data through either another file or possibly even a streaming mechanism. And this will take care of the underlying technology and you can imagine

with what you have in the public Cloud the services that public cloud is providing are giving you a bility to use this infrastructure-as-a-service know if you go to AWS or Google or you're going to find apis that's all I get to do. All of this takes the problem is you may still end up in a care. So what you need is a very strongly opinionated set of past. How do you want things done in your organization? And this bottom layer is set of services and capabilities that are established by the centralized data organization in the company. And they're they're defining how do you do security?

How do you deal with a private information you have for example of gdpr regulations. You have not eaten, send you have a right to be forgotten data subject rights. You have the governance for the for the Privacy. You have a good rest rules. How do you deal with a death in the organization? All of these needs to be defined by the centralized organization and they need to be put in place. So the platform are using this capabilities and they are in sync following the practices for the

Enterprise. Not one of the critical elements is the basic dictionary and the data catalog. They give you inside what is where in the Enterprise and they describe all the data stores with all the metadata including what is the private what is a sensitive information in this dataset and song? So you have three things you have the set of rules how to do things that is set up a decentralized mean you have cell service data infrastructure platform that builds on what is typically offered

by public Cloud, but it is hidden behind a much nicer interfaces. And then finally the platform owners are using these capabilities and they are creating the data product that can be directly consumed by other platforms. And then just as I know that you know, the beta product that is produced can be actually exposing a polyglot fashion. So for some clients is the better to expose the lights are the fall for others that you may want to put it in a relational database for registered. You may want

to stream it as a cup stream. So you have a a polyglot approach. No, the ideas that we have here are not necessarily that you so I will punch you to some turn the reading to get deeper into this a topic reading that I was particularly command is article that started the idea and this is the article that is introducing the term data mesh how to move beyond the monolithic data Lake to a distributed data mesh in this article Johnny is a starting from the deficiencies of conventional data Lake and it is explaining the passage that

leads to the data product and the domain ownership of data that can be that rapidly combined degenerating faster inside excellent article of highly recommended. but there are other things because you can look into that will be on the deity of data Lake and when you approach it from the side of flight from it is a worthwhile understanding what are the things that tell you need to introduce to go beyond services in order to create the ecosystems of software components that give you in the whole business and are there is a great article

from Harvard Business Review not focused on software, but it will give you great ideas how to handle this just a few tactical thinking and we go into providing a larger ecosystems old of value that can support changing business needs and I thought you would enjoy is the article future of data engineering. One of the things that really relevant is important for our thinking here is the steps towards the decentralized ownership of data. We're going away from the centralized data Lake principal. You have a number of stores that along two different domains. And the one thing that

Chris some addresses very well is the importance of Automation in order to get the agility. You want to have a services that are providing the infrastructure and you want to have the automation that make human labor and human activities. I'm not so important then eventually you would like to eliminate that and it'll give you an example of a what that mean. If I go back one slide. Imagine over here. Are you have the ingestion of a Sunday though? You may have some transformational day today will be happening and

eventually you produce your data further now imagine that the way how you deal with that is if you want to just a file and then you are going to do the processing you read through the records you do some filtering and Transformations imagine you put that in the relational star you do some more Transformations and then you put it in another phone when I said automation. You can actually fully automated this thing's so for example in a system called the terraformed you have infrastructure as correct. And that means that for example in public Cloud

you can Define that the terraformed is going to define the space that is a file in object storage in the cloud been the terraformer will have instruction to have a database created. Let's say she ate the my SQL database for you and then you have another file that this day. So this is the one part. And then you would have the additional steps. That should be fine in your data pipeline which may include for example airflow which is the orchestration framework for a transformation of data. And with the airflow you would Define what are the steps in processing and with all of that

you can Define all of the data infrastructure that is in the platform. You can Define all the processing that is happening. You can reply you can Define the container rise of deployment of processing into the apartment. So all of that is fully scriptable and ultimatum, or do you want to be able to instantiate the whole platform the days apart was terraformed and the softer part about Cooper dentist would be the ideal choice for that. That is enabling you to fully

automate this and instantiate for example for texting you can eat then she ate the whole thing from the beginning to the end test it and then you can throw it away. They have a repeatable processes talk to summarize. What we have is a model in which the platforms logically on both operational and analytical data that is not let that Define the date of product which is a high-quality definition of the date of is their service level agreements platforms use the same infrastructure as data lake. So there is no

shave their what you already have is a day today could be full report for that and the press 1 teams use self-service data infrastructure. Like for the question is is it for everybody and I would say it is. Or small teams and businesses that are not changing much having a date a team that fully understand the business understands that equality and other issues. That might be the sweet spot. But in a very large Enterprise that is growing where the platforms are sizable and have reached software and data needs where new business capabilities and new business

offering is being introduced this platform data look so very very attractive and his ability to speed up things in the Enterprise and with that I will finish my

Cackle comments for the website

Buy this talk

Access to the talk “Platform Oriented Data Architecture for Acceler...By Vladimir Bacvanski, Principal Architect, Paypal”
Available
In cart
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “Global Artificial Intelligence Virtual Conference”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Buy this video

Video

Access to the talk “Platform Oriented Data Architecture for Acceler...By Vladimir Bacvanski, Principal Architect, Paypal”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
566 conferences
22974 speakers
8597 hours of content