Oh, you think technology is your ally. But you merely adopted the tech. I was born in it, molded by it. I didn't see a typewriter until I was already a man; by then it was nothing to me but obsolete!The computers betray you, because they belong to me.View the profile
About the talk
RailsConf 2019 - rails db:migrate:safely by Matt Duszynski
When you're dealing with fifty million records, simple migrations can become very dangerous. Come and learn from my mistakes instead of making your own. We'll talk about what's going on behind the scenes in your database, and how to safely write and run some common migrations. Remember, uptime begins at $HOME.
Thank you all for coming to my TED talk. Thank you all for coming a real comp and thank you Megan for the wonderful introduction. I'm sure a lot of y'all have done this. In fact, all of y'all might have done this, but how many know how to do this? By the end of this talk, hopefully all of you. Before we get too far and I want to explain who I am. I'm at that doesn't steer said how you can find me on Twitter. You can find me on GitHub. Please do a fun fact. I can plug a USB cord in
the right way round first time everytime. That's not true. Actually I had you going as Megan said I work for a high-tech company in Southern California. We're probably the highest tech company. And we are building a platform for the Cannabis industry and the millions of cannabis users around the world. Sounds lofty. Dhh might take me to task for that. But if that sounds interesting to you definitely stop by the booth over in the exhibit hall and talk you can't miss it. We are always hiring and it's a
damn fine place to work. If you just like having a good time and you're happy with your current job stop by the booth anyway and RSVP for a party tonight, which is going to be awesome. I hope to see y'all there. So in my time at Weedmaps, I've been there for a little over a year and a half and unfortunately, I have taken down production. Before I go into that though, I want to do a quick audience participation if I can see can can you raise your hands if you use a relational database
like postgres a bicycle as your primary data store? Well, you're ready to talk about database migration. So I figured that was about about what you'd expect second question. How many of y'all have tables with 1 million rows are more in that database. Okay, good good in this talk is very applicable to y'all because the sort of migrations that work pretty well 450 records tends to not work so well for 50 million. X so for our first example, let's talk about something that I think you would consider pretty simple adding a column to the database.
What's a we have our users table? Whenever application we've got millions and millions of people using it because we're a really successful startup. And we want to be able to mark a user account as active or inactive. I don't know how you got this many users without needing to mark them active. But bear with me for the sake of this example. Pretty simple database structure. We had a column active. It's a Boolean true-false. And we probably want to Mark our existing users active. So we don't lock everybody out of the
system. So in order to do that, you might write a migration that looks something like this add column users active booing default to true. It'll get the job done, right? wrong No, one can login. No one can sign up. Everyone is furious. Your app gets raped on Yahoo News. Your Reese's get angry. It begins the downward spiral somehow a plane crashes into a mountain. We're not sure how that happened. But in order to understand what's going on here we have to
talk about what's going on under the hood with locks. Lox locks are the mechanism that your relational database uses to ensure that multiple operations. Don't update the same row at the same time. There are a lot of different lock modes. Unfortunately too many to go into detail here. But to give you an idea they range from an access share which basically is the lowest level. I'm accessing data, but anyone else can access it to I'm not doing anything dangerous all the way up to access exclusive.
Which is no one else can do anything with this piece of data that I'm trying to access because I'm doing something that is potentially dangerous like updating or deleting. It might change. When you're doing a database migration even one as simple as the one we saw earlier. It will obtain an access exclusive lock on the table in question and prevent any other connections and the entire rest of your application from reading or writing data to that table until the migration is complete.
So let's watch that in real time. real ish time with our example users table You'll have to imagine the remaining 9.999 million users. I could only fit so many on the slide. But our migration begins running in the first thing is going to do is obtain that access exclusive lock. Can I add the column so far so good? And then we're going to begin filling that column with active true. So people can login. So our little elephant here is postgres doing its thing. We marked the first user Alice is active.
Keep it going so far. So good. We Mark Bob active. first problem Alice is trying to login. But you can't because the entire table is locked your application can't read data from it. And she stuck there at a hug and screen or 500 error or she goes and uses your competitors application. Overall not good your migration still running though. Mark to the next column through Charlize active now now Bob's trying to login same problem can't do it can't read the user data. Can't log in.
The migration is continuing to run and users are continuing to try to login and you wind up with this pile up of queries that can't be responded to and this pile up of users that are very upset and migration finally finishes running. Releases that locked now that it's done filling on the datum. And what you're left with is a Quadra of dissatisfied customers. Probably not what you were planning on when you decided to Mark users as active or inactive. So, how do we avoid this? Oh, don't do it. First of all.
Don't add columns with a default value. That's a dangerous operation because of the Locking mode. It uses and can and will cause downtime if you have enough reason your database and enough traffic on your system. It is worth mentioning at this point that postgres 11 actually addresses this particular problem in certain circumstances adding a static default value. No longer requires obtaining a table level access exclusive lock. But note the caveat under certain circumstances there other circumstances were adding a default value for example backfilling a new uuid column.
Will obtain that lock and it will be a dangerous operation also. It's hard to stay up-to-date on that latest version. For example, we use Aurora at Weedmaps. And Aurora does not support postgres 11. The highest version. They have is ten 6. So you're at the best of your database your sass provider your DBA your Ops Team to be on the latest version to be able to take advantage of this new features, which you might not be. So General could practice don't add columns
with the default value Suite. We know what to do now. We're not going to add that calling with the default value get rid of that little bit right there. We're at it without a default Value First we're going to change the default once the column has been added and then we're going to go back an update all of our users to mark them as active and of course we split it into up and down because we're good rails developers and our migration should be reversible and we tested this in development and everything is hunky-dory nailed it.
not so what's happening this time? Sorry, just wanted to see that Loop one more time. What's happening this time? Now we got to talk about transactions. Transactions physically combine multiple database operations into a single All or Nothing operation either all of your statements succeed within the transaction and it is committed or something fails violates an Integrity constraint. Whatever goes wrong and the entire operation is rolled back. So transactions provide for guarantees atomicity consistency isolation and durability, which you might hear referred to by the acronym acid.
Check out our sister site LSD maps to learn more. Not a real product. So this is a topic for a talkative itself. Probably many stalks and we don't necessarily have to go into it and great dep here. So if I said to say that the consistency in isolation that a transaction provides are guaranteed by locks specifically consistency means that any operation will take your database from a valid state to another valid state if you write a row. Row has to be valid in order to be saved and it went over to be read by subsequent queries.
Isolation guarantees that uncommitted transactions are uncommitted operations from one connection can't interfere with the operation of another connection. If the transaction has been committed the database hasn't made that consistent transition to a new state and so other connections can't read the data and there's some interesting stuff going on under the hood with topples and indexing and multi-layer indexing and again not going to get into that right now. Come find me at the booth because I would love to know that I'm database stuff with you now.
When are roads being updated and exclusive lock is issued to prevent other people from updating that same row make sense? Right if I've got some data to write and you get some dated a ride and we try to do the same time. What's going to come out? Well, we don't know that violates the consistency principle of relational databases. And so you cannot do that updates to Rose are always made in a Serial man, and that cereal order is guaranteed by locking within a transaction Luxor issued on a first-come first-serve basis.
They live for the duration of a transaction, even if the statement that requested the lock has already executed super most of your database operations. This might not be a very big problem because they usually happen on the order of milliseconds where it does become a problem is as you saw earlier when you have to perform millions of database operations on a very large data set it tends to take a long time. So, how does this tie into migrations? Well through a little bit of that rails magic that we all know and love everything
you do in a migration is automatically wrapped inside a transaction. So but you can imagine what this looks like, but fortunately you don't have to because I have more mock tables. We're going to start after our columns been added and it's time to Mark our users active again. Starting with ro one we do not lock the entire table this time because we're not adding a default value for the column. We're just individually updating all of our Rose. However, in order to
update a row you have to obtain a lock on it. Do we Mark our first rose active Lockett Market active true move on and even though that was successful as I mentioned that lock doesn't get released until your transaction can commit and that can't happen until you update all of the Rose just like you wrote in the migration. So now we're on road to We obtain another lock for that row we Market is true. We keep going in this manner. And now Alice is trying to login. And unfortunately not
real is locked so we can't write data to it. How Charlie's row is locked Mark him active one thing to note about this is that it is actually better than the first failure case that we described because some of your users those that haven't been updated to be marked as active are not locked yet during this migration and can login successfully. So Doug is a happy user. Then he goes off on his merry way to use your application. None the wiser that there are any users
experiencing issues. And now we move on to his row and now more users are trying to login. Unlock that we Market is true. And Doug is now active fortunately for him. He already logged in. And now we're migration is done. All of the locks are released as the transaction is committed and you're left with slightly fewer but probably know less angry user again. Not what you were intending to do. Now if you were listening closely, you might have noticed I said camper
right data because update locks don't prevent you from reading this previous value of the row until that transaction is committed in the value is updated. So When are users login? We're just reading the password we can read data. Why is this a problem? Are you sure you're not making updates when your user logs in? You might want to consider it. And even if this particularly contrived example doesn't necessarily apply to you. I'm sure there's a table in your application that you can't afford to block updates and right
traffic to bring e-commerce application that might be your orders table. If your ass ass blocking service that might be your posts table. There's always some reason for your application to be writing data and while you can chance it and maybe too unhappy users. Is it an acceptable margin of error? It's better to just not back though data inside a transaction. Lot of advisers don't do it. Right if only everything was so easy. But hold on a second. We like transactions. They have all those nice acid properties that we want when we're manipulating a database. We don't
want to get into an inconsistent State. We don't want connections to start conflicting with each other that makes her an unstable system and that's just not going to work. So don't notice says don't back down data inside a transaction. What we're going to do? Is first oldest line out of the migration because we're inside a transaction. We like that we want our column changes to happen in a safe manner. And we're going to write a completely separate migration to mark all of our users is active
that runs after. The first line here that you see disabled ddl transaction at detailed a description later disables that Global transaction that I mentioned earlier. So it's implicitly enabled but you do have the option to explicitly opt out of that behavior while you're running a particular migration. then within our up method Rather than marking, every single user is active inside a transaction, which is dangerous and rather than working. Every single user is active outside
of the transaction, which is dangerous for different reasons. What we're doing here is integrating through our users in batches and wrapping each individual batch inside of a transaction. Baptized defaults. Mm. Of course. It's configurable based on your individual needs but the key here is that this transaction that's updating a thousand Rose is going to complete and commit much faster than a transaction updating 10 million Rose orders of magnitude faster believe it or not. So that changes your lag time from minutes where your entire user base can't
login or use your software to the order of seconds or even less while an individual subset of users might receive a slightly delayed response which in almost all circumstances is an extremely extremely good trade-off and your users soothing your users. Most likely won't even notice that anything happened. So our rule of thumb here is don't mix schema and data changes because you're going to wind up getting lumped in this transaction and that's going to delay your updates cause extensive locks to be held
for long periods of time and generally a bottleneck your system at the database. Don't forget to put those updates back inside a transaction just at a scale that works for you and works for your system and isn't going to block for extended periods of time. Now, we finally got our column added. We finally got all the values put into it. We only cost to production outages pretty good week, right? And now we probably want to be able to look up which of our users are
active or not. So we're going to add an index. You'll note that this says for postgres only and that's because the internal implementation of indexing in my SQL doesn't have this particular failure failure mode that we're about to discuss again not going to talk about how different meta indexes point two different couples with pointed actual dislocations suffice it to say that postgres has a slightly more complex indexing strategy that has a particular Edge case associated with it.
So we have are very simple migration here or adding an index to the active column on users. Totally straight forward nothing could go wrong. Course you've been in stock long enough to know that's not why I put that up on the screen. In addition to running that inside a transaction adding an index in postgres will by default obtain one of those nasty table level locks that we described earlier. Wow, it read all of the data in the table and write it all to the index in one Fell Swoop
and just like you are locked from updating an individual row while postgres is trying to write an index. You can't make changes that would affect the way that index is written such as updating a different row and so basically Your database state is Frozen by postgres until the index is calculated in written at which point you can then modify rose again. fortunately This is generally realize there's a bad thing even by the folks to maintain push press and they added a very simple algorithm
that can create a new index and write it to disk without blocking reason rights. So in addition to disabling that transaction at the top just like we talked about earlier we're going to apply the algorithm concurrently to our index which tells postgres to add it in the non-blocking manner. simplest thing in the world Nope, as I said prescription only if you put algorithm can currently on your migration and you're using my SQL. This will fail with an error you won't break anything but
It won't work either so. Don't try it or do you try it? your database but whatever you do. Do I add postgres indexes concurrently, it has a little bit of overhead, but it avoids massive table level locking that will generally cause problems with your application. All right. So now that we talked about several potential failure modes and things that you shouldn't do. Now it's time to talk about a time when I did those things that you shouldn't do. So at Weedmaps, we have a pretty significant audit trail of operations that people are doing in the system makes sense. Right? We want to
know when people are logging in who's making updates to White. Compliance security General goodness, and what that looks like at a very high level is we write actual changes to the model to our main postgres database and then we write an audit record for whatever changed what operation happened to create update or delete and the properties of the model that were changed and we write that separate database. You can easily write it to a separate table in the same database. That's an architecture decision based on size of your data set and scaling
needs, but I will say I'm very excited for real sex in the movie DB support cuz there's a lot of hacker that we had to do to make this work. But that's neither here nor there. The specific case that we were trying to address is that we needed to update a compound index on the audit table. You haven't seen a compound indexed its same as a simple index is just on two columns. In this case. It was a polymorphic Association, and we wanted to index on both the type and the ID Colin because that's how we're going to be clearing. So what do we do next
weekend the new index concurrently, of course. No locks, no transactions didn't make any obvious mistakes with the migration? What could possibly go wrong? a lot Ultimately, we want to taking down our entire production system for about 45 minutes, which is about 45 minutes in the middle of the afternoon. As people are getting off work if people are using our service to try to find canvas so bad time to have a massive production outage. But what happened? If we look at the migration, it looks pretty sane. We're not in a transaction.
We're adding that index concurrently. What could possibly have gone wrong? Why do we take down production with this indexing change? I don't know that we have to talk about how concurrency works now. There's going to be some math on this next slide. That's the math. Don't worry. We'll walkthrough. Little law is very simple equation that governs the relationship between concurrency throughput and response time which ultimately controls how many requests your system and your application is able to handle within a given time frame.
For example, if you have a hundred requests per second, and you're averaging of 40 millisecond response for that request for those requests, I should say that means that at any given point in time you have for request being served by your application. is also works in Reverse where if you have 4 threads in Puma and your average response time is 40 milliseconds. You can calculate and know that you can serve up to a theoretical maximum of 100 requests per second. Note the relationship between response
time and concurrency here. If I add a zero to your response time. You also end up adding a zero to your concurrent requests. and I don't know how many of y'all run 40 or more servers. But what I do know is that scaling up from 4 to 40 is not something that you can do very quickly and possibly won't expect to do it all. So little fly helps us determine the theoretical maximum level of concurrency that your application can support at any given time if you use a p.m.
If you've heard of little while before if you've applied it to your own application, you might be thinking in terms of rails requests and responses, but your database obeys the same principles. It's a service. It has a connection pool. It takes requests from your rails app, and it generates responses in response. Where you can run into problems here is when request start queuing because they arrive faster than your application or its database can response to
them. If a database operation blocks many requests for a long time, you'll wind up saturating your connection pool. You won't be able to serve you will be able to access your database. First of all usually. Causes problems for your application. So you probably won't be able to serve any requests and the entire system will grind to a halt probably at the worst possible time when you're at a conference and no one's on call. So Why was this a problem for audits? Well, that's because we didn't understand what was happening when those audit
records were being written. So now an auditing overview but accurate. First we are right the actual changes to the database, but II. We have a little little tiny read we're here just to see what version that record is at. So we're able to subsequently write the next version to the audit database. So rather than being a right only append log that we thought we had it turns out that every Every audited database operation was actually generating a reed in addition to the right on that that audit database. Rather than being
99% rights unless user happens to want to look at the audit Trail for their listing. We were actually about half and half reads rights. and that causes problems because why do you index things fast reads? What did we do? We dropped an index and that changed our query performance on the audit database from Several milliseconds generally acceptable to several seconds generally not acceptable. And if you remember the littles law multiplier earlier that changed our
level of concurrency / the number of requests that we can handle by several orders of magnitude. So rather than being able to handle our 10,000 RPM production traffic at that particular time, we wound up with enough bandwidth to handle 10 or p.m. So what that looks like with some more pretty animations is when we have our Auto scaling group. We have our multi-region AZ cool stuff. Is it true for a single deployment as well? But I figured I might as well give
you an accurate representation of Weedmaps. We have these different rails applications talking to the database and request start coming in there trying to generate that audit Trail as they've been programmed to do but queries that were previously quite fast are now quite slow and so once all of these requests come in there are all trying to access that audit database. You wind up with your connection. Serrated all waiting for a response from postgres. And none of your applications can do anything else because
all these requests are hung open waiting for a response from the database server. And then all the other requests come in and completely take everything down. Note that are particular example relates to our audit database in our audit Trail, but you don't have to be writing an audit to be affected by this any long-running query. Can mentally saturate your database connection pool on any table at any time? And if that happened all of your queers are going to be stuck waiting for those
long-running queries to complete. so what are you do you do test your database performance? You need to understand the performance characteristics of your application and understand how they change during and after a migration. And you need to do this on a regular basis. So you know what your application is doing if we had identified the audit reads as a significant portion of our database traffic before that migration ran. We likely would not have written it and run it in the
manner that we did. And even though queer performance was fine before the migration and it would have been fine after the migration. Once that concurrent index was re-added. So slow queries during the migration caused the cascading failure that took down our entire application. so what's up, bro? How you can do that? Will tools and resources to help keep your database healthy while still allowing you to make schema changes. first couple gems thirst and static analysis tools that will warn in advance about
certain unsafe migrations the things that I've covered here and the things that haven't been covered here changing column type dropping a column adding a Json column to a postgres database. There's a lot of things that are unsafe that I just don't have time to cover. But in general you want to catch problems at Dev time not deployed time. If you write safe migrations, you'll run safe migrations. So I advise checking out one of these to Jem strong migration zero downtime migrations, possibly check out both of them. They hook into the migrate DB
migrate Rick task and will throwing error if your migration is written in an unsafe manner. Which one works better for you will likely depend on your specific database engine version schema what you're doing there don't exactly overlap. They both cover some slightly different things and it's not technically a dumb but I did want to throw it in there that get lab actually has a really really cool powerful migration helper system that they've built in that handles a lot of these things and more which unfortunately is not abstracted out into a
gem. So it's difficult to pull in and using your own project, but it's still worth checking out because they give a very good detailed explanation of these things that I've talked about here as well. second APM I think you could probably throw the slide in any talk about any application in any language but understanding your performance is absolutely critical to understanding how migrations are going to change it there all sorts of options for this New Relic Skylight scout app signal many others. I hope all
of you are using one already. but if not Maybe start by checking out the ones that sponsor this conference. but the most important tool in your toolbox is you you as a developer are going to be your own best resource and not doing dangerous things static analysis can only go so far and as we saw in the audit case study, you can take down your database in a manner that the software itself would be completely unable to detect because it doesn't know how your queries work and it doesn't know what the relative
performance of them will be. Need to keep up with changes in your database engine both knowing that there are changes knowing what in those changes and keeping your database up to date. I wish we were on postgres 11, but we're not. Practice these things in a low-stakes environment. You don't want to have the first time you write a complicated safe multi-stage migration to be when you're running that migration on a multi-million Road table in production. There's nothing
specific or special about these techniques that don't apply to smaller tables and I advise implementing this even in situations where you don't think you need because when you write safe migrations by default, you never have that cognitive overhead of man. How big is my table? Does this one need to disable the transaction or can I run in the transaction? You don't want to spend time thinking about it. Just do it safely all the time and you won't have to worry. It's always better to listen to that nagging little voice in your head. Did I
do this? Right? Is this safe? Am I going to take down fraud? I would always rather have someone prove the two migration is safe or challenge me to prove that immigration is safe than assume. It's safe by default. Right up until that little adrenaline hit or large adrenaline hit when all of your metrics go into the toilet on a deploy because you got a bad migration running. quick little recap Commit This to your memory donut Collins with a default value don't
backfill data inside a transaction a transaction use many smaller ones. Don't make scheme and data changes in the same migration because you wind up with that Global transaction that you need to avoid. Do you add your postgres index is concurrently? And do Monitor and tester database performance before during and after migrations test all the freaking time. and remember uptime begins at home Thank you.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.