About the talk
In this session, we will discuss the differences between Cassandra and Cloud Bigtable, why and how Spotify migrated some of their workloads, and how they built an auto-scaler for Cloud Bigtable.
I'm going to go through some history storage is 45 look like in the past years then sin is going to continue with some details about cloud bigtable. Then I'm going to go through the immigration process that we are doing at Spotify and then I'm going to add some details about the windows process. So have some history. So when I join Spotify was three or four years ago or two years and a half ago, and we just started moving to Jiffy remember and I've been exposed
marginally to what was there before there was physically on family facts answers. We had four of them to renew your license when Europe and so we used to manage our machines and being in the storage infrastructure team at the time and I'm still am We have basically to database will look like this like we are Cassandra and Foster. so that means that we also manage a bar they have it is not just the machines so in this extremely popular Spotify and I'm going to start telling you why it became so popular in the word is versatile so I can
do anything quite well performance she has a lot of no apps that you can tune to almost optimize at yusuke's level I would say station is KLOVE Sprewell from very small like order megabytes to terabytes in terms of its sides and it comes with a hand the bear language similar to SQL and light switch instructions so you can express stinks like insert if not exists which is a pretty powerful feature Betta fish mermaid standard Elizabeth Spotify to sell and them here are some numbers and I took me quite a while to find that it
was from like when cats on with it. That was for one year and a half ago. So that's a lot of machines for clusters and in that but Cassandra, even if it was widely used is not perfect. It comes with a set of promise. I would say the biggest is the fact that storage is, is Kabul to compute and this makes topology changes very expensive because I think like for example, you are running out of disk and you are forced to scale out your your cluster even if maybe you don't need more computers for example and 22 changes needs to shuffled around.
So that takes a lot of time by Christopher Lee show all the data. I just had his very tunable, which is good. But he's also bad sometimes because he's too much has a very long learning curve. It means like things needs to be experts at hope everything Cassandra and clusters the most critical wants but in general we have the will be leaving the office in Squad so that the team's manager of their own plaster. So yeah pretty high operational burden in the sense. Scales until it doesn't pretty much we had some cases where we like
we really felt like we couldn't have any more notes to a specific lost. Everybody would blow up and stand over so we used which is not these attacks doesn't count weight affects person backups. So also that was freaking you might seem we used to build tools basically to Alligator this paints. So we provide information to other teams in the company. And hold sort of things today in everything. I actually like for example, we have tools for just configuration and
like they staying and positive changes. One of them is sea star that we open source, they sell basically I parked parallel ssage topology eyewear Soho in nose, for example, if you want to restart the sound of a nose in which order you should do it to not to avoid having an available date range of Sabrina. We have Cassandra turn that music. Maybe is the most famous superhero. Today's about Auntie Anne's for your prayers and I we will choose for this task force with a hassle and you have to be sort of changes psych whatsoever things and we eventually goes to a
pretty good spot. Like say pretty much give me a faster with his number one notes with this machine with backups in exports on a schedule and so pretty much quite easy, but there's a lot of work and some operations are still extremely expensive like imagine you want to update the operating system on 24000 host fix on time. In deciding we start looking at cell providers and eventually chose you but I mean did the Motor City was framing a much bigger scale, of course,
but and Divya was too busily move up the stack in a sense so free as much as possible our time and do things more closely related to other business logic installing Dwayne infrastructure and edging machines. Should be cheaper. And yeah, that was basically Daddy and them. We should also you should also be possible to so having more time. And what can I do for an obstruction level to add value on top of the second largest and we have many success stories of Dyslexia for scheduling
jobs in multiple jobs. I guess no sex pics of jobs for that security, and I'm going to talk about later. I would say we mostly succeeded in achieving these things and now we're fooling recipe. But even if it's one of the staff at the situation looks like this for a for a while more so basically just make sure that the physical machines to BMX and work work fairly well, but at some point we asked ourselves like can we use something that Google is running like men in
stores right B table top easel and paints. Melee. Hell yeah. I just want to provide a brief overview cloud bigtable and where we think it's delivered customers tell you like Spotify. Really quickly what is cloud bigtable cloud bigtable is a petabyte scale fully managed know SQL database the original big table technology was built to index the weapon in history has served billion user applications within Google like Gmail and search the cloud products is meant to
serve use cases were low latency data access scalability and reliability are critical as a fully managed service at Scales seamlessly, and we're all so well integrated with the Apache ecosystem offering in hbase API. Unbeatable sets within a broader portfolio of database and storage products all of which remove the burden of building and managing storage infrastructure specifically. It's it's in the non-relational space at the wide column store again offering super high performance in terms of latency and throughput
So what kind of cloud bigtable offer you where does the value come from? I tend to focus on four areas speed scalability full management and Integrations. I'll talk about speed and scalability more and a second because I think that's what's differentiating about Pig table. And integration is with a broader ecosystem is what allows it to plug easily into your existing architecture, but I want to pause for a second on full management because it's something we're thinking about very holistically. What is it? What can we do to lessen the operational cost of managing the
database because some of that yes is managing the infrastructure and database. We're having our srt's page so that you're not paged. It also means delivering the availability date of durability and security that you come to expect from Google, but for me, it also means providing value-add tools to make the database easier to work with and every way so one quick example of that is key visualizer. Which were you launched last year at next which is basically a heat map of Yorkies face and shows data access patterns. So. Not only are we managing that infrastructure and keeping you from
we're also giving you visibility into your data and making the parts of the process that you own like schema design a lot easier. So as a managed service were thinking a lot about how to reduce that operational burden. And as far as jumping into speed against this is where I think they table is unique on the throughput side were offering 10 megabytes per second and right through 5 and 220 megabytes per second in readstown through but I want to emphasize here that generally in big table reads and writes are the same price so customers with right heavy
workloads in particular will see ex is performance because of that high write throughput and we also offer low latency random data access and the single-digit milliseconds. Scalability things to focus on here again big table was built for scale on has the ability to handle internet size applications. That's because performance would you measure in queue PS4 April 2nd scales? Linearly as you increase the number of nodes think are polish guidelines say that you should expect 10,000 QPS per node.
And that's true whether you have three nose or 300. I'm a second. You can scale up and scale down with big tables super easily. Again. This is what makes my table price effective for spiky workloads because you can scale up two handles and back down when things cool down and Amelia's going to speak a little bit about how they leverage this characteristic and building their autoscaler to get cost savings. How to dive a little bit deeper on how they table with able to deliver that price or that performance and scale. I want to touch a little bit on how
big table works so big table separates the processing layer from the storage layer which allows us to optimize throughput by adjusting the association of storage and nodes. So this example we start with three notes and forced splits of the data that are stored separately in our storage there and in the rebalancing example, we might be seeing heavy traffic to data group a note wine is experiencing large load in this case big table is going to automatically rebalance and move some of that traffic to another load and Sharon broader better overall
performance. Anakin resizing comes into play when another note is added to the mix and again were able to rebalance without any down time because instead of moving data around we're just reassigning the note to a different data group. I also wanted to touch base on one feature that's been recently made available and that Spotify has leveraged pretty intensely, which is a cloud bigtable replication replication is allows you to increase your availability with a 99.99% Escalade
by allowing you to fail over in the case of a Geo disaster. It also allows you to isolate batch and serving workloads. And finally if you're replicating globally like Spotify says you can serve a global user base by making by offering customers low latency data access to local replicas. There are a few things that I think are special hear you write data once and you can write data to any one of those replicas and it's automatically replicated everywhere. I went 0 consistency and finally there are no manual stops.
So you're not writing tools to repair data or synchronize rights and deletes replication just works and will eventually be good sister. We have a if you're interested in learning more about replication. I hope you'll join Carter page or engineering manager and talking about it. Tomorrow. That is DBS 307 at 2:10 p.m. Building a Global theater presents. That's my quick plug for ya handing it back over to Amelia to talk about how they approach the migration. So yeah,
I wear my Christian Spotify first list. Joseph Shore recap about Sandy said like the coffee table is managed is eventually consistent but no electrics instructions storage is a couple from compute and is designed for DSS. So the first thing you might ask yourself is are the same thing as most of the time you came at us that you had on Cassandra to big table. We found at 45 at the most common scenarios that we can match very well or eventually come see solo consistency of a nice when we have ephemeral Scimitar system data.
maybe you have regionally Precision data and buy Disney mean that certain Saturdays access only from certain regions life example European data is Sonic's s in Europe and song because they have some cool features are available in the way and the singles response actions, please play and of course a large distance, you can smell very well. Are you scared? Just a dumbass? Well, we found some we found out we had as you can imagine some cases where we used is inserted not
exist, and he's not a big deal. So it's simply you simply have to find something else. And also if we have quite a lot and it was kind of unexpected that it is the beginning of this so-called small Kilauea stores week old so we we have not much data maybe order of 4 megabytes to a few gigabytes, but still you need to a global reputation. So it's better to have low latency. And they have very very low or sometimes very high to be us. These also doesn't matter well. So F
summarizing. This is the truth that we found during our migration. So there is a pretty big overlap between Cassandra and big table for this low consistency and hydroboost scenarios, and we have these to you scared us with our medical smoke value stores and more consistent and I am Fairly sure that there is nothing in the blue part of this graph. So it's one of these three cases with your boyfriend. Took a while to find out this I suggest if you're doing a migration like this tries to figure
out these things beforehand. Just don't say oh, yeah go and you speak table and was going to come back to you on sale doesn't work and for us was also kind of funny because of the beginning we had two features that were very important for all of our missing backups for which we build an internal stopgap solution and Global education that now is available. But at the time we build custom replica call pops up and I was a potential income system for Behavior because we could lose Sunday, especially Hifiman correctly with something about ordering of inserts
and deletes. So it was pretty funny that after two weeks away our Anyway, this is Turkish are now are available and we use them a lot. And so now I'm aggression is growing slowly. We are I would say Hathaway. But yeah, we see that the number is increasing. The number of animals is the freezing. Is a success story of migration I would like to talk about the play the cluster and peacefully implements is very crucial for implementing the coronavirus for sure. If you're familiar with that any stores, basically the device state so we have multiple devices like TV
speakers and it's tour the states of these devices. Now the state can be pretty big if anyone called the other device works and he's very sorry. It's a little consistency. I mean, you should lose a state or if you just send everything you want is not a super big deal and You need a very very high performance because imagine like you skip a song you go like that's a change of the state soda and if you can so many users so many devices to be there are is like thousands of requests particular. We had we used to have them Cassandra
150000 creepiest epic entry data centers. We used to have 450 machines with 7000 quarts more less and 2.5 petabytes of SSD just for this Lobster. And it was one of the examples where I was saying like this cuz it was really pushed to the Limit. Like we couldn't add another than a hundred nurseries cluster. They did the team manager in this class that we used to have. I believe one incident today during the last weeks of life. And as a fun fact or I would say what was tragically bad about this cluster is that we were using just very little of the theater of the DS for the actual unique
data because we had to over-provision a lot in terms of this face to to handle Cassandra compassion's why you don't want to run out of this space in a Sonic Blaster and we have to go to cruise on a lot to to pretend that this can infections. I'll be stable a worked out very well. We recently immigrated like 3-4 months ago. And the team is very happy. They didn't have any operational headaches so far. We all just kill this cluster in between 250 2015 or so. We also have last notes and we pay for the day we
use so we have a few Nevada and it was also nice because we we work pretty closely with Google Engineers the word Adelphia new client Java client with your last result. Experience. And this is like a chart of the the cost drop the day we shut down the Cassandra machines. You can see the green green the computer computer and James cost doesn't goes down to zero because we also have the service notes in the in the project but never last is like a proximately a 75% drop in cost which is pretty big. Just a few words on how we
did this migration. It's an online migration. So we didn't say working out now officially disable for like 2 hours and we swapped movie was always online, but we did it. We implemented storage interface based on the table and we run the show Gallery Cassandra Court Cassandra was the primary they have is for Apostle this process and we slowly send traffic to Big Table in order to the tack box to detect problems or understand performance in janaf to gain confidence that
everything works and by doing that you can also compare like the day that you read from the sound of the day that you read from the table and see that was amazing. And the more confidence you gas the more traffic Union you can send me table and eventually you have hundred percent reason brides and when you're when you feel good you do the swap. It's very simple because this Canada is the fact family also eventually basically you're you're you're two days of his Converse the same day for a
few hours is fine this case when Katie's playlist and in case I swear you have persistent they'll so you cannot tolerate errors. Like let's say our playlist cluster. It's must be at least with a cast on diversion. We have to distribute the date of his ass is not the best but you are the reason 3D visual Francis all-around athletes that even if you think often do the days off from the to database by the police on North Haledon expert unless you do logical delete If you don't do
logical listening to change the skin, which is also very painful and you're working around this. Another thing that working on is trying to obstruct the say the database layer so having I need to face that obstruct away, which database we are using like you may interact with cloth Banner velvet able to send them or less in the same way and they need something more for the future. So maybe have to do this again. It's going to be much less painful, but we could do a simple man pretty cool things
like doing this all storage transparency for the user and it's pretty cool projects and open source it if it works. Said that during this process. We also developed and also scaler which it wouldn't be impossible because Sandra at all. Because there's a sad changing the number of nodes in a custom exhaust. It takes time, especially be classic take days and so will be table is very very nice to scale up and down in within minutes. It started as a pet project and it was taken from bits and pieces about their other scary. I think we fight all the skaters at
some point. So we want to rule them all up. So we called it the amazing because it was killer and is very simple set up a schedule of my service with some State Capitol. You physically the largest Model S dash cluster days. If the closest Staters every 20 seconds believe and take an action or not. Before going more into the details of the logic which is very simple. I just want to show what is the effect of using those killer normally when you have a constant amount of resources in this case a constant amount of notes, you would
say that you have day and night then your next cycle since if you can't understand what that was that was the case because you have more users during the day last night you did you receive the opposite so we can keep this if you use relation pretty much constant and changing the resources over time. This graph show that he works pretty well anywhere salsa for replicated clusters because we do this killing at the cluster level know today so we can get feeling that independently different regions business.
and the logic as I was saying is very simple. So basically the user defines target load and Justice Killer given been given you have number of notes and just tries to calculate and 1/4. The Lord is what the user Define adjuster. Write a proportion, basically. And then we have some some checks some we want respect respect some properties. So it's very simple State machine. For example, we want to respect the stores for North Dakota. Otherwise the cluster of snow for sitting right. We want to have an want so then you never know. What should be in within the boundaries of the user stats.
And also we don't want to scale through frequently because what one is because we have a certain amount of that mean duration. So there is a folder on Administration, but also we have some latent defects Associated to upscale to downscale, which is also something we are working to we Google salt as you see is a very simple logic and for us worked really, well the only real problem we we found is related to data jobs beautiful job in our case because they
Garlic Southern spikes of loads on the cluster and they'll just cannot scale fast enough. So basically a recommendation from Google is like if you're doing a very heavy didn't feel job you should have scale in like 20 minutes before And that's the solution no solution to be to trot Lane straddling the date for jobs and indeed. It sounds pretty crazy. But it's fairly Global issues in our case. We use a frame of seal so we could Implement in Seal Say Yes, start slow and slowly increase the rate
in a way that they can cope with that the weather. Another solution could be to use like a predictive transforming a predictive process or integrate some predictions like Facebook profit. And this would work. If you have beautiful jobs on a certain schedule not except random during the day because it was the recommendation and it's like more less what we were doing so so far. So yeah broken off magistrates, we will always pay our things are very happy when moving away from Cassandra. We think for us either way for we was We Want
to Be Free as much as the first time it's possible. He's not the immigration is not an easy process. You still need to understand what it's like you can't completely forget how things works, of course and I would say if there is one key takeaway for us in our storage infrastructure team is that you need to make the process as easy as possible for the user. So like documentation like Avenged evangelizing migrations like the mud at the most you can do or
tooling also the most I can do is put down the battery this And y'all just kidding is a good example of Valium and relies on top of this and the time and most of the time now, I'm not I'm not even sure how much money we're saving. But the time is the time savings are a big wins.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.