Duration 36:42
16+
Play
Video

RailsConf 2019 - When it all goes Wrong (with Postgres) by Will Leinweber

Will Leinweber
Principal Member of the Technical Staff at Heroku/Salesforce
  • Video
  • Table of contents
  • Video
RailsConf 2019
May 1, 2019, Minneapolis, USA
RailsConf 2019
Request Q&A
Video
RailsConf 2019 - When it all goes Wrong (with Postgres) by Will Leinweber
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
1.18 K
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Will Leinweber
Principal Member of the Technical Staff at Heroku/Salesforce

About the talk

RailsConf 2019 - When it all goes Wrong (with Postgres) by Will Leinweber


Your phone wakes you up in the middle of the night. Your app is down and you're on call to fix it. Eventually you track it down to "something with the db," but what's wrong exactly? And of course, you're sure that nothing changed recently…

Knowing what to fix, and even where to start looking, is a skill that takes a while to develop. Especially since Postgres normally works very well most of the time, not giving you get practice!

In this talk, you'll learn not only the common failure cases and how to fix them, but also how to quickly figuring out what's wrong in the first place.

Share

Okay. Hello everyone. Welcome to my talk about when it all goes wrong with postgres just really briefly a little bit about me. My name is William Wilbur on my part of sight of Stater which I recently became a part of Microsoft. My website is bigfish.com. If you do happen to go there right now put your computer on mute. If it does autoplay middies and part of the The Fighter the middle part is going to be a little bit dense. So I already put the slides up and so don't worry about like taking pictures of some of the part that gets a little bit complicated

besides are already met up there and I'll tweet a link to it afterwards to And so first though just a little bit of my back on where I'm coming from is currently at work on outside of cloud, which is a managed version of Silas and open source extension to postgres that turns it into a multi no distributed database and then before that I was at prokupecz crust and so you're right both of these companies. I've been you're helping people with their applications that run on postgres providing postgres running postgres for people and I've

been doing this for a long time. I still like to think of myself as an application developer first rather than like a DBA And I also as part of this talk a lot of the help and knowledge and insight came from my current colleague at situs Dan Farina. And one of my previous colleagues at least with Roku secreta that happens over is over and over again when you're running post if the people you start seeing the same sorts of problems some of it coming from pages and alerts that come to us first and a lot of it also from support tickets were

the database is working fine, but the customer still seeing some sort of problem like the problem didn't trigger an alert for us, but still something is wrong and that also just talk I is the audience is intended for what I imagine many people like you hear who happened, you know, what happened to use postgres but aren't necessarily postgres experts are rather than like if everyone here was BBA full times. This is the problem with postgres though is that it's pretty good. Then that's what I mean by that. I mean that that's the

sort of like, what's your weakest thing with your weakest attributes for a job exhibition? Oh, I just try too hard. No, but the thing is because it is pretty good. If you just are running like one post-credits Oregon tattoo posters for your applications. You're not going to get experience with all the different ways to can brakes. And so every time that if you when you do run into it, it'll be the first time for you. And so hopefully with this talk I can take all the different ways. I've seen things go wrong and give you a starting point. So maybe you won't memorize exactly what to do, but

you'll you'll be able to know where to go to look for four things. So what do you do? If you have a problem with postgres? Well, you type it in and you go to whatever the first stack Overflow thing is in blindly follow this advice no matter what Now suck it up to actually listen to advice here is that's actually bad don't don't kill the other postmaster No, really. What happens is you put it in and you see some mailing this post from Tom Lane and that you have to read a bunch of things. But that's also not not the

case. The problem with pus Chris's it's a complicated system cuz it's not just postgres itself. You have the operating system. You have the hardware that it's running on your the network may or may not be reliable and then you know, you finally get postgres itself and the problem could be in any one of these spots but there is one bigger problem that is really going to take the most common problem and that is using the database too much especially when it's an all of a sudden problem. Like everything's been fine and

all of a sudden you're the problem 99% of the time I may not need it. But sit some large percent of the time it is just the database is being overworked. Like 95% a time I think is the application is causing the problem with not necessarily postgres itself about 4% of the time is a future in post called Auto vacuum, which I can go into details. If you're interested after the talk, it's a little bit too nuanced to talk about right here. And then 1% of time is like everything else and the problem is with the auto vacuum is

interesting and though but the 1% cases those are like the really interesting things. That's what everyone writes blog post about or at post press conference is give the entire talks about business 1% of problems. But most of the time that I've seen when something goes wrong in postgres, it's actually result of the application and these percentages. I don't have a card numbers on these are just you know, approximate guess numbers, but it is hard to convince people especially customers who are writing a support ticket to me so I can okay. Everything's wrong. All my grass is saying that the

database problem and of course, I didn't change anything in my application. It must be the database who here just by seeing the picture knows this story. Okay, that's good that it's on but not everyone so you don't have to do some people you'll be bored. I'm sorry, but everyone else at this is an interesting story that I like to share when it's when it's appropriate and so during during World War II planes would come out and they would you know, some of them would get shot down and some of them would come back and when they were looking for places

like where can we put extra armor on the plane? They were looking at where it got shot. Now, if you you just look at that you be like obviously the places where the bullet holes are that's where we should put armor. They actually at the exact wrong thing. It's cuz which the planes that came back with the ones that survived and so when you get when you get shot like in the middle part of the wings or in the tail there, that's what you need to armor cuz those are the ones that aren't coming back. And so this, you know, I didn't change anything that you're sort of it's it's sort of being

misled a little bit and I didn't do any deploys that you know, I didn't do a migration, Mexico. Nike Nike still have the same number web service haven't scaled-up. I know this is often a very big red herring because what if you know, you got more traffic than your application normally gets or maybe the people who are logged into your side or doing something. They normally don't do or often the case. You have one big user that just logged in and they have you do 10 times as much data as everyone else in your system. And they're there is no causing

all these changes and access patterns in your out that even though you didn't change anything something has changed. And so what happens often, did you start running out of a particular resource on your database going to be a memory or CPU or what have you and if you were already close to the edge of the performance envelope. That's when you know, it's going to snowball and then start causing real problems. And especially because if your queries that

used to take, you know, very small amount of time that you have a manageable user query that you know takes 1 seconds. Which which is for Davis Square One Second is pretty long, but if it only happens every once in a while, maybe that's fine. But if it's suddenly because of these other resources that it got exhausted starts taking twice as long that can cause a big problem on the flip side. If you have a very frequent query that is running very fast through middle seconds, and it suddenly goes to 12 those small clear can be a downstream of that big crate that just

got twice as expensive. And even though most of your crews are the small ones you be like a man that got four times as expensive. It's actually not that the cause and so what I'm going to try to do this talk is how you can figure out what the actual causes he can get in to address it. Abba before we get into the details, there's some assumptions that I want to make that if these aren't true you have to you kind of have to address them first. And that is that the hardware that your database is running on is good. You look like you're high quality components.

If if that's not the case like that's going to cause problems that are going to be out of the scope for this talk on top of that. This is sort of like kind of like a Maslow's hierarchy of needs but for database servers so hard where you want that at the bottom maintenance, you know, you won't have a good backups. You want to have you know, everything to the packages up to date so on and so forth, and then finally that your app is in a good State and that's where we're going to going to look at. The other assumption is that a person should not be crashing. So

if one of the problems are seeing is actually like hard crashes with postgres and their meeting Cordon dad at that is a separate issue. Do you know Siri something else is wrong in that again is going to be out of the scope for this talk, but that's when you might want to go get help from like an expert in that is posted shouldn't cash as long as you have overcome it off. Now this is sort of a nuanced point that I think it gets missed a lot. So that's why I want to share it with you over-commit is on by

default in Linux and most operating systems and what that is is when an application asks for memory, you can ask for more memory than you're going to use which is often the case and the operating system will give it to you and when you actually go to use it if there's not enough memory then the the OMG out of memory killer will kill your process most of the time for most application. This is fine. They're they're going to ask for more memory than they were going to use everyone kind of does this and it's fine but postgres is built in such a way that it actually checks the return status

when it tries to get memory and if it can't remember it's not you know, it's a problem but it gets when he can manage it can take care and kill queries that are using too much memory while keeping the whole system online. But that is only if you have overcome it turned off if it's on the other memory killer will kill postgres. You'll have to go in after crash and restart up and then that that's a problem and the other thing though is, you know containers everyone's excited about these days that they have a separate problem. We're even if you have overcome it off, but you're running

postgres inside a container be a doctor or just regular lxt. The sea the sea memory container in a galaxy doesn't do that bookkeeping and so it's not quite the same as overcommit on but it it causes the same sorts of problems. And so if you are running postgres inside a container, you're going to get these weird crashes and its don't and then the other the other thing is if you do have some of the larger extensions in postgres, those are also known to cause stomach rashes. So

that's a place to look also. Okay. So I've spent a lot of time telling you post. It's probably not postgres. That's the problem. So then what so we can reframe the question a little bit as how can Post-Crescent Linux help you diagnosed with the problem is And so if you think about your server as a number of resources, you have the CPU like how much work the processor is doing how much memory is being used the disk IO how much is being written in red from the discs parallelism? Like how many things are happening at the

same time with postgres and becomes post R Us. Every time you make a connection to it it Forks off a background process it. A process to handle that connection. You can approximate the parallelism that plays Chris is doing with the amount about guns that are being used. And then also what blocks that postgres is taken. 4 74 lakh we're only going to talk about that when it can be a problem. But if we're going to take a deep look at these for parts and when they are being highly used what

what's may be causing that and so we're going to put them across the top and they're going to turn on and off and we can talk about the different combinations of this and this is where it's going to get like a lot of information all at once and I'm sorry, but that's why I posted the slides ahead of time and so don't worry, if you know the individual details of this part just more thing about the patterns and what and then the answer to this the shape of things. And so if you're having a problem with postgres and you you look at at the

at the what's going on, and the CPU isn't I the memory isn't High the disc isn't being used and there's not a lot of parallels in system but there's still a problem. And so this could be something easy. Like the credentials are wrong that maybe you recently did a credential rotation the network a man could be interrupted or if there are a lot of locks one lakh could be blocking everything not allowing a progress. This could be like I make migration. It's hasn't been committed yet. And if something like that also take a look if you look at what's going on in postgres, and there's these

idle in transactions what that means if someone opened a transaction but didn't commit it. And then that that can also cause problems that those are the things to look out for that. Now if we turn on parallelism, so there's a lot of backens. What's Happening Here is probably that the application is submitting a lot of work that's not getting processed. And so the application keeps opening up more and more connections offense could be a connection leak. If your web server or your background you like using one of the four king threaded web servers and you didn't and it's not

closing it after the connect other connection goes away or if you have a background worker that is spawning a lot of back on jobs. Like these connections can be leaking if you are using a pool or maybe it's the pool sizes too large and one that brought that fly bring that down and if if there are locks that it could be like someone could have been taking a lock. Similar in the last case, but the application is written in such a way that it's dumping more and more connections on. Murphy switch a little bit and you see that there's a lot of IO coming in out of the system, but there

aren't a lot of back ends. This could be the work load that your application is causing like someone logged in that is reading a lot of old data from the desk. And so things that used to be in cash now or no longer in cash. And so it's thrashing stuff out of disk into the couch and then have to evacuate that and get more things out more common for this case. I see is if your database recently did a failover, you know, if you're using some sort of automated system, I'll be at in a hook bus routes or anyting and their recently something happened and they failed you over the caches

are now dead. And so it has to reload all of them. And that's so that things will be slow until that happens more, though is the disk is being used a lot and there's a lot of Atkins and so this is the same as the previous one, but Cuz everything is slow because it's not in cash the applications of piling on more and more work and until things can get fast in cash again, and this is one of the problems I would like Auto scalars if your web application has an autoscaler on it and everything slow, but it's because of this problem you're going to add more

and more web servers are making more more connections and it's not going to actually make anything faster. And so y'all just those are cool sometimes. Now if we switch to memory being high, but the disc isn't being used a lot and there aren't a lot of back and there's not a lot of connections. This can be if your queries are doing lots of large group. And so it's a do the group I have to pull all of that they tend to memory and then and then group it on what its will you pull the trigger by and that that's usually what it is. It can also be

if the the where the data is written into the pages if there if it's not compact enough in and how to access all of that that can that can cause some problems. Normally though. It's it's got you you're going to look want to look for your cruise or doing group buys. again, this is are one of the similar patterns that you see when you're looking at these different attributes on when the parallelism kicks on this is again, like, you know, the application usually against submitting more work because things are slow now if So

if that large memory put them cuz of the group eyes are doing that in the applications adding backlog that can be the problem and also in this is one of the cases if there are locks on the locks could be holding onto datasets longer than the normal and then so it does the system can't release that memory to then use it for something else. So this is one of the cases where you want to if you're seeing this sort of memory and parallelism. This is what you might want to check for locks if it's one of those things. I got no more than these patterns. This is this one. I don't see this one that

often and so this is not as interesting but it is part of the the patterns of the things turning on and off so I had to include it now this one this is more interesting. So if if the CPU if everything else is being used a lot so a lot of memory a lot of disk a lot of back ends, but for some reason the CPU isn't being high, but what happened is happening is something is causing a lot of disk access and all of his other problems assertive Downstream from that so that the backend Spike and the the memories Spike don't these are more of a symptom of something causing a lot of

disk IO and so you want to look at queries that are maybe doing a lot of sequential scams or if it's one of those things where it's someone logged in. It's causing old data to be red. And now we finally gets you in the CPUs. Hi, this is typically if you have a very small workload that small dataset but it's being accessed a lot. So everything can be fit into memory. And so you don't get any of these other problems, but it could be like a lot of sequential scans on a very small table and that way it's in memory. So you don't see even though the

sequential scan. You're not seeing the disk access. Again, our friend parallelism. This means the application is sending a lot of of backlog because of it. I'll try to leave this could be just simply a lot more work like your application just sending a lot more data that you're a lot more than you're expecting and this is just causing, you know, a lot of work to be done where it normally wouldn't have been. If CPU is high and discus hi, this is going to be large sequential scans run amok so the discs could beat maybe of a system where the

discs are very fast and it's just sending the date of the post goes faster than the back ends are reading it and so the TPS to do a lot of work just serializing the date of sending it over the network and those sorts of things again parallelism. This is a symptom of the other things in this case. CPU and memory being high but the discs are being used and there's not the back ends with the memory. Hi this again. I want to look at look for those large group has and this can often be if the results that can be fully buffered before it sent that's causing the

CPU to do a lot of work. This is a case where maybe you want to look at your network connection and see if if CPU and memory is high and what your network connection is sending a lot of data then, you know, your application just may be requesting much larger iresults. That's the normal. I'm so this one. Everything is consumed except for disc. This is maybe one of the most important things this what I mean here, this is entity workload versus entity X the workload. Let's say you have a user like user 7

and the the that's the entity and their workload is lifting their friends. Like if you have some sort of application that user can have friends and if you simply look at your your application Level metrics you can say you don't have a normal amount of users logged in my API calls to the list friends framework extends controller is fine. But what's happening is maybe this is the case where users 7 has in a millions and millions of friends. Maybe they did something where they just clicked you do a different for everyone in your whole system. And

so when they login it's causing a weird problem that if you just look at the parts by themselves, you're not going to So went when this is the case, it's usually the CPU is high. There's a lot of memory being used and then the application is piling on that guns. The other important thing is if your application a lot of people will add in instead of actually deleting data, they'll add look at deleted at timestamp to the database and then just not and you know, just to do with a soft deletes this can cause a lot of problems because the windows other user with

a lot of soft deletes if the indexes that you have on that user one of the things postgres can do is you can have an index on a column but you can have an index on a column that excludes, you know where the deleted at column isn't know and so if you this is one of the things that you active records not going to do for you by default you can have to go in and write these indexes manually so you can actually say in a Creed and Exxon user ID. Where are deleted at is Is no and that way it'll keep the index of small and you'll avoid this problem.

On the flip side if the Peril if you have any other problems, but this is being high in the parallelism is off. This is most likely where someone is running some sort of reporting query on your database and that's blocking the short transaction things that your application wants to do. If you start seeing this pretty pretty frequently what you're going to want to do is set up a follower database and put the recording freezes over on that. So you can keep your main transactional database nice and fast and finally if everything is hi,

this is probably the application filing on work that cuz we have the parallelism. Hi, I just have CPU problems and then touch So I know that was a lot. I'm sorry, but I had the idea for those those things going on. And then I was committed to having to do all of them. So decide to be posted that you can go and use it as a reference. But now that we know the thing the type of things that can go wrong. What are the tools of the trade of hot? How do you figure out that those are the things and one of the things that I shied away from there? I'm primarily Ruby developer,

but I shied away from it for a long time is tools that report out see symbols. I am being you're not trained in CS I never really worked with Steve for a long time and you know someone I would use it to look perfect over the vlbc sounds like a lot of kind of scary but but I'm telling you actually it's it's not that bad. It's we're going to go through some of these and this can give you a very big insight into what post office is doing and if you look at it now while your databases healthy you can start getting

intuitions of when things look a little bit wrong and so the tool perf if you prefer record and then pass it through the process identifier of postgres and then look at the report. This is like an interactive thing and you can see here. I'm looking at the postmaster and this one is not really doing much. We are you just waiting for connections? It's not really doing a whole lot and you can see in a most of the time here that the 6.9% is just just waiting in the colonel for something to happen. You can also do perf top and this is like

a top for CPU stuff but for proof and this will give us a bounce around in real time and show you what is actually using a lot of CPU at this at the current time. And where in postgres it's using it again here. It's not this one isn't really doing anything interesting at this moment. And this site right here is a great reference of all the little things you can do with Perth and I really recommend checking out this I got to decide to press that you can the building could be there. But similarly GDB

debugger 204c programs. This is a story about a One-Shot it if you type g2e by itself, it's interactive thing. This will do a One-Shot thing of tracing the back trace of what's going on at the BTS back-trace. What post office is doing at the current time and you can see here in this case. It's doing weight events. Wait event start Weight block and that's so just waiting for something to happen. This is a back-end doing nothing. In this case though. I was craving a big index while I took it and you can see

here. There's underscore BBT, so it's for Beatrice and you can see that it's writing some pages and then down there a little bit lower. It's it's doing an index build and so if you like just poke around in like wired like running a query or while you're building in Text yourself on your local machine. You can look at this and you can start getting a feel for what's going on. And so like if you have a query that stalker blocked, you know, don't be afraid to do this and you can kind of get a piece together what's going on in like with the symbols like I you can see her next. Like it is

named, you know, something, you know, approaching human-readable stuff. So it's not as scary. I put off doing this myself for so long. So I encourage all of you you tried. It's not it's not so bad. Another great tool is iostat. And so this one if you do exes for extended in Poughkeepsie more things and gives it in megabytes a second instead of some other unit. And then the 10 what this will do is record for 10 seconds and then show you what all your discs are doing. And so this case you can see am I writing to the Discovery am I reading from it

a lot and you know, those can be different problems depending on what you know, this is useful. Also if you have your write ahead log drive on postgres set two different discs, you can see maybe I'm generating a lot of wall turn and like that can beat that can help guide you to where the problem is. The other thing I owe top again, like a perv top of regular top. This will show you in real-time. It'll Bounce Around what processes are actually using a lot of disk and you can see here. I caught my database here doing an insert. And so the insert one is actually, you know, doing a lot

of disk drives and then the one that's Idol is doing much less. And then of course I H top if this is like regular topping machine, but it's a shop is a lot nicer. It'll show you more color is it will show you what which CPUs are being pegged and stuff again. This is your your run it when everything is healthy so you can get a feel for that and then when you're when you start looking at it when it's bad that gives you some insight to what's going on. And then b w n and G. I don't know what that stands for but it is the one that will show you what your

network cards are doing and so you can see here if you have maybe one device is your your network interface for connections. And other one is for if you have like a network attached storage. I you can see that sort of thing and then looking at the back ends of that type. He grew up is a sort of a combination of PS and grew up and just you're doing that and just doing your word count instead of seeing how many things are going on. This gives you a like a rough ballpark of how many Batman's are connected. And then if you actually connect to postgres TV static tivity will show you

what each back and is currently what the query currently is. And then. Stopped statements. This is a great to us an extension. This one will show you a query that you've ran in the past. It'll summarize them take out the constants put him a question marks and then give you a lot of insight there until summer of these tools. This is more for if you download the slides later as a reference of if you're looking for a pair Elizabeth looking for Network, which tool can help you there. And I'm so what would he notice? That's all great. We we know what can go wrong me know how to find it. What

what do you do? So hopefully the tools that will help you track on the cause maybe you need to do a configuration change your maybe one of your customers one of your users is getting ripped guy too much are connect to IP ban them for a little bit and you know some sort of rate limiting hopefully already have rate limiting in place cuz adding that when everything's on fire is challenging so think about putting everything in place, even if you don't have problems yet. A lot of times you need to ask you if your database to move it to bigger Hardware

get more RAM gets, you know faster discs. It's easy to say it's easy to move up one of the problems. I was if you are over provisioned, it's really hard to say if it's safe to move down. There's not really It is an adjuster on unfortunate truth and often. It's like I can tell you hate you need to get bigger than just want to charge me more than I like you're running out of resources. There's no nothing to do about that and finally code changes, you know, maybe one of these cases with the application backlog is

10 + 1 queries in all the sorts of things that you do rails and some of the tools can help you with your nephew read receipt you go to a page and see that it's generating a bunch of queries, you know, maybe going to do some code changes to theater at some indexes do last request less data, or maybe do some joins to get it to be, you know, last total queries. And what are the things that also when I was reminded of when I was putting the stock together that I want to end with? Is a sock that I saw it was a

short like 10 minutes Octonauts on YouTube from velocity 2013. And this person based it on this other paper by John Rasmussen and really answered a lot of questions for me of why why are things always breaking all the time as I wanted to end with that it sort of related. But the most think it's interesting and like to share it when I can and so if you think about Your system of your company as having a couple different boundaries and one of them is the economic boundary where if you go over that your company

runs out of out of money and goes out of business. And so there's a strong gradient push away from that. And so if you start going towards that just the the the business will push push that away if you're not going to go through that you're not going to go across that boundary. Another boundary is the workload. You can only have people work, you know, there's only so many hours in a day. You can only hire so many people before it gets unmanageable. And so that's another boundary that it gets pushed away from And so it was the last boundary is the performance boundary. And so this is

when you go over that that's when everything falls apart give a big outage. The problem is there's nothing naturally pushing away from this and so do you have to get your applications will always drift towards this problem and answer what happens is the or the other problem is you don't know where it is until you go over it. And so we do is we put this error margin up above where like, you know, no matter what, you know, we're not going to let our our percent 95 of our requests go over one second or whatever, you know, this word severe margins,

but what happens is is Overtime you get closer and closer to it. And you know, you eventually go over it and all the alarms go off the pages go off and in a push it back but what happens if you keep doing that and don't you think everything was fine. Let's just relax. That's just push it over here. And you know overtime the the are Marching gets closer and closer to the actual part where everything is going to fall over and you go over the line and then you have a big outage. I like that like that fire animation has to write in an

apology to everyone you put your post a public post more than everyone everyone's upset. You know, we're never going to do this again. I don't care how much it cost. I don't care what she was going to go all the way across the boundaries and never happen again, but what happens at courses in your later, you're all sitting around like a rubber that that out advice. It's pretty bad. Hopefully, it'll never happen again and if there is more details I got was a I thought that was a really interesting way to think about why this or why this keeps happening forever.

So I just wanted to and end with that and thank you very much.

Cackle comments for the website

Buy this talk

Access to the talk “RailsConf 2019 - When it all goes Wrong (with Postgres) by Will Leinweber”
Available
In cart
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “RailsConf 2019”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “IT & Technology”?

You might be interested in videos from this event

September 28, 2018
Moscow
16
166
app store, apps, development, google play, mobile, soft

Similar talks

Braulio Carreno
Principal Software Engineer at ActBlue Technical Services
Available
In cart
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “RailsConf 2019 - When it all goes Wrong (with Postgres) by Will Leinweber”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
577 conferences
23231 speakers
8691 hours of content