Parikshit Savjani leads the Program Management for Azure Database for MySQL and MariaDB service. He has over a decade of experience with relational databases. His past experiences include PM for SQL Server Storage Engine, and Azure Database for PostgreSQL service, and as Solution Architect for Microsoft Data Platform. His current focus is building a resilient, reliable, highly available and secure open source database service on cloud.View the profile
About the talk
Postgres is one of the fastest growing DBMS in the industry in terms of popularity. Its extensible architecture combined with truly open source community development makes it very feature rich database engine with unprecedented speed of innovation. But as a developer or DBA, scaling your Postgres workload can be a complex daunting task. Microsoft loves Postgres and with its Azure Database for PostgreSQL Hyperscale (Citus) offering has significantly simplified scaling and manageability of your PostgreSQL workloads.
In this session, we will discuss distributed PostgreSQL architecture of Hyperscale (Citus) and some of the common use-cases and patterns where it shines. You will learn the concept of distributed tables and how you can apply it to achieve massively parallel processing with Hyperscale (Citus). The session will also give you a glimpse of new Azure Arc data services platform which allows you to deploy Hyperscale (Citus) anywhere from multi-cloud, on-premises to edge environments using Kubernetes. Come and attend this session to learn how you can leverage Hyperscale (Citus) to run your Postgres workloads at any scale, anywhere.
Welcome all. So Nikki leonetti introduce me but a quick introduction about myself. I'm breakfast and one of my key Focus area in Microsoft's, I'm the program manager for Azure database for postal service on Cloud on Asher. Right. I'm passionate about, I have experience with the relational databases. I am, I was a program manager for storage engine for sequence of a port day. So, this is where I spent most of my time in the relational databases and I learned about it.
But what about Microsoft sequel server or, or, or its products it's about us and this beautiful space but through the lenses of our learning as we were investing and Brendan are service of world-class manage fully managed postal service on our Cloud platform. So I want you to keep that in mind. As I go through this, in this fashion, I will talk. Some of our learning as Microsoft Word would investing in postgres with learning posters through our market research, as we were talking to customers wanted to build Solutions on the
cloud as we go collaborating with the posters community and then how we are contributing back now to post this community as well. Without any further Ado, let's get started. In India, we have a long tradition of worshipping elephant. Guard Dentistry right? Turns out. Now if you start worshipping posters elephant as well, call slownik, right? You might see a lot of benefits as well as one of the most popular databases in the industry right now, and it's growing in popularity. Let
me give you some data points to bag this claim. In 2019, stackoverflow, survey has one of the most loved and wanted database in the industry further. If you look at the engines cranking, postgres has the claim of being the most popular database for two consecutive years in 2017 and 2018 order. If you look and break down into the trend of posters and the popularity print in the Deep engines, you will learn that in the last five to six years, the popularity of the bosporus has grown steeply. If you compare it with any other relational databases
in the market, you would see the popularity of posters has grown steeply. But it also implies is, if you are a newbie in the post office area and if you're interested in your area. Still sick now might be a good time. And if you're a business decision, maker if you are in a start-up, if you're trying to invest and put your business updates on posters again, now might be a good time. This might be a good word of confidence for in customer saying, okay, I think I can put that other customers also doing it. I think now is a good time for me. Also to consider it seriously. So,
why prosperous and why? Now? So again, as we were building this service in Microsoft, we were also curious about the strength to understand what exactly is going on behind the scenes of a twenty-year-old product supported. By the way, is 22, decades old product, was already there in the market for some time in the last five or six years. Why did developers and customers got a tractor to post, right? Let me share with you some of our learning, based on the market research. So what are the top reasons? Why customers are developers blog post. First. And most important reason
is it's truly open-source. What do I mean by that not owned by a single owner or company or an entity? It's through a community-driven, the group of poor communities across the globe are maintaining record bass contributing back to the port with. This is one of the primary reason. It's it's a piece of mind, if you want to be a starter, it's a piece of mind for you to put your bets on some articles like this because then you don't have to worry about when the Locking which means hey Phillip, my Ford work. Will will there be any audit? Will there
be any licensing policies? Will I have to go with only one Cloud. So these are all the Practical challenges with businesses face today, right? But being owned by a group of computers. You can be rest assured that him. I don't have to worry about all of this and I can just simply right. Focus on my own business and post. Know, if you look at the board design principles, which was put together by Michael Stone record and Lawrence Road, 20 years back when they were writing about posters thinking about 4. What are the
four design principles for postless? Is user extensibility? Erect. And ability is very powerful. It's it's a very powerful and which allows for squares or, which makes us very feature-rich. If you think you have a variety of data types, ranging, from RS, geometry geography, you can store IP addresses as well as Jason be capabilities. So this, this supposed to be low level extension on top of the database Engine with seamlessly, integrates with the product. And as a result of this, You can use it for
scenarios like nosql cenarios. You could use it for iot scenarios. You can use it for geospatial. If you are, if you're working on location, tracking, think about that you have to use location data. Was best to be a good good option for you. So, there are these extensibility in postgres is what makes it out of a feature-rich database, which can be used as if I would say a nutshell. It's a great job. Its robust reliable and consistent. The fact that it is 20 years in the market. Also tells you that it is reliable and
you can put your bets on it and further all the major development framework which you can think of off python Django Ruby on Rails. All of them support supposed to say, this makes it very devil of a friendly. You don't have to go and Search & Win when they. Will you not betting on some project with his two-year-old? It's 20 years old. That is what they were answers. You're looking for might be already been discovered and reported in some of the forums out there in that respect up. Speed up, Innovation. The fact that it's Community
contributed. The turnaround for the bug fixes and Innovations is much more faster if it if you see posters relieved, this new version with lyrics sensibilities every year. So this is very valuable for developers and customers as they get the latest and the greatest features within a Euro. So, so so this is very valuable for most customers. Find me this, this candy arguable. But I think one of also, the cool reasons if you see of Enterprise batting a betting on open source, database has, is it's a good as a managed service in most Cloud providers
of mine for Enterprises. They have a throat to choke. So if we will provide the support of who will provide the manager, should I be focusing on my business, or should I be focusing on managing my database? So the fact that is available in all the major Cloud providers, it. Also makes it very portable, it is a common denominator so you can't go wrong when you're betting on posters. So, in summary, I think, the sentiment, which is resonating across the community, is clearly the New Lenox of databases,
because it's just very similar characteristic. Highly sensible, developer-friendly Enterprise bets on it, and I think that makes it a very great product for you to invest. Either your skills, or put your bed from a business on on Postal Service. No, let's switch gears a little bit to see what are the applications. Customers are building on cloud or what are the common patterns of applications with our so-called Cloud native applications that are three major categories of the cloud
bit about the customers in the cloud today, first multi-tenant solution. So if you can, think about startups are distributors in the, in the B2B orbit to see space and billing software as a service application. Think about Salesforce, which is a disruptor in, its in its Market. Think about CRM applications. That is if, if you are in the u.s. dental is TurboTax, which is a tax filing software, as a service offering by impute, Shogun this tells you like the
That is, that is high on some of these multi-tenant saas application or there's more requirements for this kind of application. Know what are the common database characteristics with these kind of applications look for to be able to store multi-tenant data and provide security and isolation? That's p0p Pub. You would call it a high priority for them. The next is the database. Should be able to scale as the user adoption Cruise. So as they get more traffic, as they
get more customers, deleted transparently linearly be able to scale. The second category which we have learned as real-time operational analytics. So with the growth of iot technology and majority of the data streaming products, like Apache, Kafka, Apache, spark Kafka where businesses can take real-time decisions and Believin m, l algorithms to build like a self-healing system or predictive maintenance system, right? So, if it again from the database contact, major requirements of the databases for Defender, publication fastest
high-speed injection rate because the Telemetry data is is coming pretty much not even second, would be milliseconds, which should be coming in any second and your database should be able to scale with the spot injection rate. And you are also running Calgary's across the street about you, right. So that has the characteristics of the database which is needed for this kind of application. Look up Ecommerce, and digital payment platform. Another requirement is for the applications to have a hybrid transactional, analytical system. What do I mean by that, your typical
system? Traditional banking systems were wealthy, people stems. You would build a separate data warehouse and extract the data will if ye'll pipelines and load the data. But now the new requirements of which are also dictated by the regulation, regulatory requirements and innuendo requirements from businesses, what has happened is you need to run analytical query on the oil. So you cannot afford to wait for it to land in your date of arrows. Think about credit card application. You want, when you want to do a fraud and live As soon as a transaction happens, somebody has to detect it to
prevent the fraud, right. So that is the third nature of applications which we are. We are seeing in the industry. So if you know, quickly summarize the common theme, the common theme of the the database requirements, across all these three type of application is the database should be able to support massive data volume, and it should be able to process this large volume of data. This is a common theme across all of this, so no, switch back to full strength, right? Let's see, what does, how is the parallel
processing capabilities evolve evolving? Repeated of year was Beth 9.6? This is the first time that they change the optimizer code to support our library plant processing. So and then the new operators with a plant operators are being added so that they can support more and more twenties. What that means for you is, as you upgrade to the latest and greatest release of phosphorus and ask you. Start scaling up Ardmore CPU to your server. You will start seeing faster play the response time and your workload will be able to scale but
then you will only be limited to certain extent. You'll practically hit Ali. A ceiling fan wear when you are scaling up, right? You can go to 64th street views 120th, Street views, find a 912, but at some point in time you will max out and inherently in postgres community version. Natalie that is not distributed processing capabilities, available out there. So this is one of the challenge which post office has today. And how do you solve that? Because the demand for with the data explosion that you're seeing the demand from the business and application is 2 to scale and process the
data, but you started small but now you want to Ink the scale linearly and you want the skill on demand and that problem is still not solved by posters. So and So this is one of the key thing, which you need to keep in mind as distributed posters is the need of the car because the explosion, you're saying, the kind of application patency of the kind of distributed computing power, which is available for the application. Now, with kubernetes, with Cloud. You can think about the the scale, which at which the application can still with the databases have to
scale as well. So, how about be kind of thinking about this and what was Microsoft's stock process about approaching this problem? The last year, they required a company called situs and you can, this is the mascot of sight has its call. Ali Khan, its combination of unicorn, an elephant to be acquired this company called situs. We just build an extension on top of both ways, which convert a single server postgres into a distributed prosperous and completely distributed architecture.
Allows you to leverage the massive parallel processing capabilities, which might be available in the cloud. And this allows you to linearly scale as your workload, demand and your business requirement Cruise. One major importance in status is open-source, the cord is available. The cord is available on GitHub. You can go to site as late as last night you should be able to browse that code. You can send two years so they don't know why it keeps flipping. You can send full request and you can also raise these issues if you encounter it
and I promise I'm not doing this. Sitting back fill, Microsoft acquisition has not changed anything here, and maybe they like this, play the more I, I don't know why I keep this for a minute, but the point is Microsoft acquisition has not change anything and we don't plan to change anything as well. So in fact, in the previous slide and will be a continuing continuing to also am awaiting the extension, and there is a monthly release cycle which be follow. So you will see pretty much a new releases minor. Relieve this coming out every month at least and you would
also see like the constant activity because as we released our own patches for our own cloud version, PayPal Ali releasing and inhibiting the community believes as well. So there is no change that one important point to keep in mind as it is not a fort. It still requires you to run on community version of off post. It still uses Community version of post stress and this is different and this kind of differentiate status from some of the commercial engines out there which kind of Port post rest, but maintain post Breast, Cancer compatibility. So if you, if you
think about this, you are still running on a committee version of the post office. And you don't have to be on Microsoft Azure to run site. That's one of the key things you have to keep in mind. So how does situs actually charger database? What is the secret sauce behind it? So the principal of partitioning or sharding when you install cytus, you install a group of survivors you install a group of Wooster service and with side extensions on top of it to
form a cluster or something. Carlos, Silva group group, you have one server, which is given a coordinator role and all the remaining survivors are our tallest volcano. Do they play the role of vocal? Nodes are distributed data resides on worker notes and all the computer processing happens on the word. Connote is again opposed messaging but they do any stores. The metadata about all the work or not and it is the one with intercept your queries. So with this architecture again, if you think From a visual format
perspective, if your application doesn't have to be changed, your application will still ride the same set of where is it doesn't have to be aware of the post office working out. Also, so evil fire the same queries and you'll hit the metadata layer will fire distributed query across all the work on. The plane will be exhibited in each of the Brokers and the credit will be exhibited in each of the volcano. The results are correlated and you get the results back. And the only difference here is
at the time of schema definition when you're creating a database, At the time of austerity definition, when you're creating the data, is the time where you have to create something, Carlos distributor tables, and the reference tables. So this is where you define the heat on which you are petitioning. The tables, right? Sorry, I know it's annoying. I'll try to keep this as much as possible. Let me show you a quick demo on how how scientists can give you power full scale compared to a single note.
By the way, if you're not familiar with this is azure data Studio, again fully open source BV naturally. Started it for sequel server, ID for postgres, and it's highly Expendable. So anyone can add my sequel, other extensions as well and again you can use it. If you're if you are you a friendly guy, you want to use, you want to connect the post office. You might want to check it out as Adidas Studio to cross platform tool, which means you don't have to be on Windows, you can be on my phone as well to use it. So this is azure data Studio. On the left side, I have single
note postgres with 1 million, the cards on the right side, it's a distributed situs architecture again running on four notes, right? So let me show you the data. By the way, the schema is, we have taken get up database and we have important that in postgres in Some of the data off of GitHub seconds is in Json format and the baby of a partition. The data in the scientist database is the latest Partition by user ID user, if you want to get up using your making, any chickens, you're
doing any comments, All That distract in your own specific, Sean, So let's let's fire this sweetie. So we have 1 million records on this database and Darla let me also filed against So this Grady exhibited in summer 1 second, let's see the count start gladion. On the single note, post, press summer, and 12 seconds. So much for notes. You can think it's no-brainer. You're just using distributed computing. Just spending all your throwing more hardware,
and then Distributing the query, and that is how you getting all the results on 1 million, 1 million record table. Let's fire up a bit more complex query. So this is an aggregation query. Let's say you are asked to run and aggregated to identify what is the number of comments for our right on on Guitar Tabs. Users were doing more, so let's let's see the execution plan as well. Yep. So just inside the suit C. What is already here again? In one second. and your son Peregrine around 13 seconds,
1 second Now, again, one of the beauty of azure data Studio to be between just made the latest releases for you to see explain plan for both of us so quickly. If I want to see explain plan for the single node post Chris, You're just being us yet. I'm doing a scan and then grouper, right? know, if you see the distributed pretty blond with cytus, So, see the difference here? The plan changes and you're so and this is for note, but he has seen 32 because by default when
we create a distributor table, we create 32 shorts. So that when you till later, if you decide to add new work or not later, the shot just has to be copied. You don't have to recharge mod partitioned into shorts and 32 is a default number, but you can go up to 128 yards also so you can for the scale out as needed. So this is the father of cytus II showed you a very small table with 1 million record. Imagine you want to work on petabytes of data, then you can just scale out horizontally and side. This will do the trick for you.
Now, the next obvious question, you might have is, so who is using Titus? Is it, is it production-ready who are the top users? And we had the same question before we acquired Titus. We obviously, you want to evaluate the product, you want to evaluate the customers they have. And we found that we ourselves for the first customers. For situs our Windows Telemetry system uses cytus data database, to actually follow the ship room dashboard. So, before Windows release
patch, has you have to see in the dashboard if it meets all the bars? If if all the tests criteria as a successful, everything is green, basically to ship a release. And this data is 1.5 petabytes. It's coming from 800 million Windows devices, which is where the data is collated process. Annex Road in. 1.5 petabyte scale, Outsiders database on 42 North which, which means 4222 Silver Sage. So there is a h, a redundancy, but there is 22 North. Lester and the total aggregated compute capacity is 3004 + 1.0. I'm sorry 18
terabytes of memory again. This is the power of distributed architecture which you can think of and the only reason it can survive and it can sustain the portable scale of the growth and even we can add more notes because there's practically no limitation to add the note, the note that is a toddler running on its own set of data in your processing. That pretty addicting. The data is important here but then application doesn't have to worry about scheduling and other needs. Again be about detail case reference out there. If you want you can put a deep dive into our blog. And there is
also a YouTube video from our engineer. Principal engineer who worked his name is Mindy and internally that project was called when is DB, if you search for on Wednesday, me Microsoft in YouTube, you might also come across that thoughts on how we are connected the solution. Which scientist do you solve the scaling problem, you solved the performance problem of your application. Now the prescriptions kill you can Harvest more data to businesses happy with the problem now, shifts into
Spanish 44 notes, so how do you manage this large cluster of solar, so that's your next operation problem of operations problem, if you're a devious as becomes your problem, so there's two options. One option obviously is you can run on a managed service on Cloud on. As you write, this is where we make money and this is our small commercial plot. But let's say you don't want to go with Azure. You don't want to go with Microsoft for whatever reasons you have. You want to be You want to continue to turn on from Isis, all you might want to run it on on
different cloud. In that case, how do you manage the cluster? So I'm new best friend in the open. So, small to manage a distributed system is kubernetes as you would have. All heard about it, go to manage a large distributed System. Light brightest as well. And that can, that is how you can manage a large group of cluster. So, let me show you on how you can do this and how we have done this as well. So, I have a covid-19 installation on Ubuntu VM. This is not your mini Cube. This is trying to afford
about communities in a dr. Susan cabinet. Is in a golf course. It's the single node server, not running of a massive scale of architecture, but it kind of gives you a sense of what, how you can manage using communities. So let me quickly so, if you see here, Single node, kubernetes running, one point 16.3 version. And what we have done on top of that is where they build up line tool open Klein tool, which allows you to manage situs cluster. So you might have heard about your art. I don't know if you've heard about this, but Azure Arc is our vision for customers to
run our products across any platform, across a Microsoft buys, Road on Prime. And David, this is possible is using the kubernetes fabric layer. You might want to run it on any case you might want to run it on PM's, whatever works best for you, same applies to gcp or other Cloud providers as well. And what you would do is create postgres. I will provide the name of my postgres instance, which I want to create. The namespace in communities, where I want this to be created. And the number of water notes. I want to create. Let
me start with two. And it is asking me, what is the date of volume size, which I want to be, which I went to probation and again there is no requirement for you to have a specific. You need any persistent store which gives you high availability. So you can choose to run it on S3. You can choose to rent an Azure blob storage, whatever works best for you. if you want them to load balancer, IP will go with Northport And this was this run in parallel, let me show you. What are the existing
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.