Duration 45:56
16+
Play
Video

PGConf India 2020 - Avoiding, Detecting, and Recovering From Data Corruption - Robert Haas - EDB

Robert Haas
VP, Chief Database Scientist at EnterpriseDB
  • Video
  • Table of contents
  • Video
PGConf India, 2020
February 27, 2020, Bengaluru, India
PGConf India, 2020
Request Q&A
Video
PGConf India 2020 - Avoiding, Detecting, and Recovering From Data Corruption - Robert Haas - EDB
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
326
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Robert Haas
VP, Chief Database Scientist at EnterpriseDB

I am a PostgreSQL major contributor and committer. PostgreSQL features which I have authored or coauthored include unlogged tables, fast-path locking, index-only scans, and parallel query. PostgreSQL features written by others which I have reviewed and committed include partitioning, logical decoding, event triggers, and foreign tables. I work at EnterpriseDB as VP, Chief Database Architect, managing a team which is responsible for both Advanced Server, EnterpriseDB's flagship database product, and for our company's ongoing contributions to PostgreSQL. My team is great, and PostgreSQL is awesome!

View the profile

About the talk

Hardware or software failure or user error can result in your PostgreSQL database becoming corrupted. In this talk, I'll discuss some things that you can do to minimize the likelihood that your database will become corrupted. In case that doesn't work, or you already have a corrupted database, I'll also discuss signs you may be able to spot that will tell you that your database has become corrupted, and some techniques you can use to try to recover as much of your data as possible.

Share

I thank you, I realized when I started to put together my slides for this talk that I had a lot more things than I could actually say in the 40 minutes that we have today. So this is going to be a fairly brief introduction to this whole topic of database corruption. It's one that I actually run into a lot because when you work for a postgresql company as I do people tend to come to you with a win when they have a serious problem and sometimes when I have a really serious problem, somebody enterprisedb asked me to take a look at it. And so when I went back for my email,

I found that I had tons of different cases of corruption that I've interacted with different customers, about over the years and there's so much interesting stuff that we could say about this topic, but we're going to have to keep it short today because of the limitation of the time. So I'm going to start by giving a little bit of a definition of corruption. It may seem kind of obvious but I feel like it's easier to have a conversation about any topic. If we're all sure that we know exactly what it is that we're actually talking about. I'm going to give a quick overview also of the

three main causes of corruption. Just very broadly speaking. Then I'm going to spend some time talking about some best practices that you should deploy and your environment so that you avoid corruption as much as possible. And so that if it does happen, you will detect it as quickly as possible. Then I'm going to show you, tell you a little bit about signs of trouble. It actually show you some examples of messages that you might get if your database is corrupted. And finally I'm going to talk a little bit about some things that you can do in order to try to recover from the

situation or try to recover at least some of your data. If corruption does happen to you. I'm always a little afraid when I give talks like this that people will Get the impression that corruption and postgresql happens all the time and I think that's basically not true. But as I said, it does happen to me all of the time because the problems filter toward me on those rare occasions when they do happen. So none of this is that intended in any way as a knock on the reliability of a postgresql and I think most people do have a good experience with it. I bet they're definitely thinks that

you should do to protect yourself. I'm not sure how we're going to do on time, feel free to raise your hand and ask a question. If something is not clear where you need a further clarification, but I probably won't be able to take too many questions or we will overrun the allocated time. So let's start by trying to Define what we mean by corruption. So the best definition that I can come up with is corruption is when the database queries behave in a manner that is severely, unexpected, given the SQL statements. That I previously executed, our expectation of a database is that

if you put your data into a database, you're going to be able to get the same data back out of the database, right? Unless there's some reason, obviously if you put the data in the database and then you modify it, what you're going to get back out of the database is modified data, not the original data, we hope, right? If you modify the data and it was actually changed and then somehow it went back to being the way that it was before. Also would be corruption, right? We're really talking about when we talk about corruption. We're really talking about things that could make the the

database perform in a way, that is not expected. Given our ideas about what the SQL statements involved are actually sick. And I think there's basically two kinds of corruption to sort of things that happen when corruption occurs. When is we might put some data into the database and then when we try to get back out, we get different data than what we originally store. That would obviously be very bad and that would be corrupted. The other thing that might happen is that we might not be able to get our data out of the database at all we could try to access our data and and get

some kind of an error or crash or other misbehavior that tells us actually your data is no longer accessible that would also represent a database corruption or perhaps some bug. But for this talk, we're going to focus on the corruption case. So how can this happen? Basically, there were three ways that your database can become corrupted. The first is that you might have bad hardware. For example, you might have a bad disc, or, and we have seen actual examples of this. You might have bad memory that corrupt the data before it ever, even reaches your desk, you

might also have bad software. You could have bugs in postgresql or in your operating system Colonel or in your virtual ization layer or in your file system code, okay? So if you have a bug in any piece of software that is interacting with your database in any way, whether it is the database software itself or something else, then that could potentially result in corruption. And I think we probably know from experience, that all software has box, all software has box, I need the Third Way. Your database can be corrupted

is user error. One that I have seen very, very commonly is faulty, backup and Recovery procedures. I'll talk a little bit more about that further into the presentation. So you might say well if you look at these three causes generally speaking which one causes the most problems. It's closed. They all come up, pretty regularly. I would say that third one is the most common more people hoes their own database. Then get Hoes by their software, or by their Hardware.

So that's why it's really important to follow best practices for how you manage the data in your database. It is after all critical path. And if you are responsible for that, that isn't database, you need to put in the time and energy to make sure that you're taking care of your data in an in an appropriate manner. I divided the best practices that I want to talk about up into basically four categories backup and restore configuration storage and then Administration. So let's start with backup and restore here. I'm going to say something that all of you have probably heard at some point in

some other talk, please take backups, take them regularly. There are two reasons why this is really important from the point of view of corruption may be. The most obvious one is that if you do experience date of his corruption and you have a backup, then you could try to get back the data that you lost from your back and then you wouldn't actually lose that. After all the other reason why this is really important is that it proves that all of the data Block in your database can still be red off of the desk. And that's not something that you should take for granted. This fail, this spell

Again, from my perspective, pretty frequently even very, very expensive dicks. That that's ever say this. This will not fail because I paid a lot of money for it. Well yeah, it is still. Sometimes. So if you can take a backup and it doesn't hang with an IO error or anything like that, you don't know that you've got, still got the original data that you would send it to store, but you at least know that something is still readable off that desk and that's a pretty good thing to know. How long what does it course, it's not enough to take the backup. You also

need to make sure that you can restore the backup. Otherwise, the backup is very little use to you and you should really verify that the restored backup looks. Okay. I mean, if you just run the backup, restore scrapped and it spits out a bunch of files and you don't do anything with those files, you don't check whether your data is there, you don't check whether you can run against the restored database then you don't know for sure that that's going to work. So take backups do test restores verify that the backup tight end,

right? They're just doing that much. You will be much more likely to notice, if some kind of database corruption has occurred and you will have some options for getting your data back. The other piece of advice I have about this and this is probably the most important piece of advice on the slide except for the idea that you should take backups at all. If you was a good Backup Tool. Not some homegrown scrap, that you credit yourself. A lot of people don't do this and it often works out rather poorly in my experience. Sometimes people don't read the directions that are at the documentation.

Sometimes they read them but they missed something. Sometimes they forget about having error checks and their backup script. Very good idea how are checks in your back up script using a professionally written to Old does not guarantee that there will be no bugs because I think probably all the professionally written to us have had some but it means that somebody is putting a lot more time and effort into that too old and you can probably put into your to all and ideally that person is more of an expert in regards to postgresql. Then you are yours. So I'm not going to get into

the business of recommending. One tool over the other, there's the internet for that. You can form your own conclusions. There are several good tools out there, but really consider using one of them rather than relying on your own scripting ability. Okay, configuration. This is an area where there's not too many settings that you need to configure, but there are some and they really have an impact on the reliability of your database. In the event, that you have an operating system crashed there, really three

settings and postgresql. Constant. You need to be aware of. The first one is at the second one is full page, right at? And the third one is, well, thank method for f thinking full-page writes. The rule is very simple, they should be on. They should be on all the time on every postgresql an installation. I mean, obviously people are going to do other things. Sometimes people turn everything off during initial load and then they say, well, I'm going to turn it back on afterwards, so it'll be fine. And then, right, except they forget, and then two years later, they have a database

crashing Hoops. They have a corrupted data base. So it was the time that you saved on that initial load of the database by turning up, think off wheelie enough to compensate you for the fact that two years down the road or six months down the road or whatever you have across the database. I would guess that for most people. It's not similarly, full page, right? This is one thing where many people think they're very clever. They read the description of full-page rights in the documentation and they say, oh, oh, I can set up my system in such a way that I will be able to safely, turn

off full page rights without compromising, the reliability of the database. All those people are wrong, at least, all of the ones that I've met and how to conversate About this happened. Rock, don't turn full page writes off. You will get a corrupted database. If you have an operating system crash, are you need to use a safe now you for wall sink method as well. Unfortunately, some of the operating system specific default Steam, not actually, to be safe settings on Mac OS, X. You can't use the default setting, which is a stank. You have to use a sink right through. I'm not a

Windows expert, but I've been told that on Windows. You eat, either need to use ethnic, or ethnic, right, proof or disable the right cashing on your drop. If you don't do this and you have an operating system cache, you may end up with a corrupted data base. My general advice here is check the output of the P G Pad F think utility which is included in every rescue out installation and see whether your wall sink method. The one that you were using on your system is running like 10 times faster than some of the others. If it is, it probably doesn't really work, right? Cuz the

fastest way to you do, that wall sink method is measuring them in the method. We used to be sure that the data is all on desk and of course everybody wants that to be fast, but if it's too fast, if it's unrealistic Lee fast, it probably means that somebody is lying. And unfortunately, my experience has been that although things have gotten better over this in this area over the years. There are still a lot of operating system, a lot of file systems, a lot of device drivers for individual types of hard drives, that lie and claim that the data is securely on desk when it actually edit,

that is a critical issue for postgresql. So check your wall sink method. Another good really good idea if you're concerned about corruption, is to run with texts. I'm send able to do this when you create initially, create your database cluster using and it could be, you need to use the dash K option. This will not prevent your database from becoming corrupted. It will not magically recover your database. If it does become corrupted, but it does greatly increase the chances that you will know that your database has become corrupted because when you turn on check sums every block in the

database now has a check some and every access to that block verifies that the texting is good and if it's not you get an error. So this is particularly helpful for detecting storage corruption. That brings us to the topic of storage. I strongly recommend that you try to make your storage system, as simple as possible. The more complicated, your storage system is the more parts of it. There are that could be broken at some point and I really recommend, local desks. I know that sounds very

1990s or 1980 sometime, that's not now but there's really no substitute for a local disk. You still have a couple of things that can fail. You have a Linux kernel, you have a file system, you have, you do some kind of bright controller. You have the hard drive itself and all the components can and do sometimes go wrong. But at least you don't have a network interface card on the local machine and a network interface card on the remote machine and maybe a virtualization layer and some kind of complicated Saint infrastructure. The more complicated you make the storage, the

more ways, are there are for that store. To be unreliable. Yeah, it's a very common scenario and I have seen a lot of people discover that their stay on does not have the reliability semantics that they think it does. I don't have a solution to that problem, because I know that for many people using sand may be mandatory under their company policy and I can't solve that problem. But I can tell you that that is very often the culprit. People have a crash somewhere in the stock and they're like, well I'm using

an FS, what's wrong with that? Well, if you can figure out any of your NFS settings for liability, do you even know that there are things that affect reliability? I've also seen some of those Sands deliver extremely strange behavior. That is definitely not per specification. And just return, completely weird error codes that supposed to Island just basically not work in the way that they're supposed to work. So I know this is a problem. I don't have a solution to it. people who use local disk experience much better reliability than people who are going through

something like nff or ice Kathy, particularly if they're not very familiar with how to set up nfsr, I scheduled any don't consider what exact settings they should be using their I recommend you use Raid, don't rely on it, not to fail because this whole talk and a certain senses about failure, every component, they'll at least some of the time, right? And the way that you get reliability is by having as few components as you can and making those components as reliable individually. If you can, and hoping that someone along the line,

you're able to intercept. The problem before it causes you were really serious issue, you still need backups. Even if you're using raid, you still need to do these other things, but one of the most common pieces of a server that failed in my experience is the desk. If you use rape 10, you can get some nice performance benefits out of that. And if one of your discs fails, you'll have a second desk with the same contents, which is a really nice situation to be in. If you experience some kind of data corruption issue, you determine why The drive is bad, pilot

out, everything still works and you have your data. I recommend a lot of caution when choosing a filesystem ext4, is the default on most Linux systems these days. It seems. And that's a good default ext4. Has I in my experience proven to be very reliable? I also have not heard any reports of trouble with xfs. I think that's generally quite reliable as well. The other Linux file system that are out there and some of the other file systems that people use on some of the of the other systems. Have been more

questionable. And I know someone is about to get very, very angry at what I'm about to say. And I'm sorry, but I have not had good experiences with BFF. I know a lot of people love it, but I've had multiple people have very, very serious corruption problem. That cannot be explained by anything other than as he have a bug or maybe several of the more you choose a file system that lots and lots of people are using the better. Your chances are that it has been thoroughly debauched. There are a lot of file systems out there for Linux and many of them probably have

very cool features that you might not get with ext4 or xxffo. Well, DeBarge it is a huge benefit to you if you're trying to avoid props and I've yet to see anybody using either of those file systems, have a corruption problem that can be attributed back to the file system. I think pretty much every other file system I've heard of somebody using So take that for what it's worth. I'm the other thing you really need to do in regard to your storage door. Locks, you need to monitor, both your postgresql logs and your

operating system locks. If your storage returns an error to post rescue well-liked, IO error postgresql is going to walk that you want to notice that. You've got a lot of messages showing up in your postgresql, log file, to say IO error. That's not something that you want to mess. but even if nothing gets reported to postgresql, There may still be something in the iOS lock, because many systems do emit, some kind of early warning that says, well I'm not feeling yet, but things are not going. Well, you want to notice those kinds of messages or any heart failure that are

showing up in your operating system logs Again, the goal here is to intercept. The problem early. So finally with regard to Administration, I almost feel very silly putting this for a first bullet points on the slide but it's very insist taught me that it is an absolutely necessary caveat. You cannot just go in to the postgresql dated directory and randomly modify. Let alone remove files and people do it all the time, then go into the PG mall directory and they're like well some of these wall cloud look like they're pretty old

or maybe all of them look like they're pretty old. I think I'll just remove them right? Well what you have to understand is that every file in the postgresql data directory except for a couple of texts, configuration file like postgresql. Compton PGA is intended to be managed by the system files gone. Just as much as you do, and it has concluded that it would be unsafe to remove them if it's thought it worth it. They'd already be gone, right? So you need to log into your database. Look at your

database logs and figure out why the database thinks it's still need. All of those files. Not just decide that, you know, better than the database and you can rebuild them. If you remove them, you will have a very fast trip to the Department of Correction. The problem is even worse. If you decide to remove the files from t, g X Factor, PGC, log, people who know what those files do in the room, but it's very bad if you remove those files and it has happened. Countless times. In fact, one of the reasons why these directory Scott renamed from the old

names of PG, X, log and feed JC. Log to the new names of peachy wall and PGX Act is because the postgresql developers were hoping that if we didn't have log anywhere on the file name, people would be less likely to think that this was something that they could just blow away regardless don't blow away any files in the beta directory you need to get the database to blow away the stuff. You don't want to be there anymore. You can't do it yourself. I also don't modify files. A part of this is don't run antivirus software. If you absolutely have to run antivirus

software on the machine where you're running your postgresql database at the very least exclude the postgresql dated directory. However we have found that some virus scanners, don't really pay attention. When you tell them to ignore certain Directories. And what do you think about it? The whole purpose of a virus scanner is to run around and change your files in a way that you didn't ask for it. Remove viruses from files. That means it modifies the file, it quarantined file. It deletes file, it's making changes to your data, you do not want anything, except for postgresql

to make changes to your postgresql data. Files, postgresql will not come up with that. You will end up with a corrupted data base. And in fact, if you think about it, how does any piece of software cope with some other piece of software, randomly deciding to modify its files? According to some algorithm? Didn't think that it's crazy that that works at all. I don't remove postmaster. Tid if you do you might end up circumventing the text that prevent multiple copies of post of the postmaster for learning on the same day the directory at the same time. If you manage to get multiple

copies of the postmaster running on the same day, the directory at the same time. Congratulations. You've you're very clever person and also you just crapped in your database has to all of this stuff. Everything we just talked about all of this life that's it, make sure it works. Wrap the plug out when the system comes back up. See if the database was okay, if it does do it a few more time see if it works reliably do it when the database is under load, if the database is idle and you rip the plug out and probably nothing is going to happen. But if you're running your workload

at Full Tilt, that's another story. Obviously do not do this. With your production database do it on your testing. Okay, so if our attempts to avoid corruption fail and Corruption happens, what do we see? What is the what is the thing that we experience? As a result of that having happened typically, what people see or at least what they complain about that. Eventually makes its with me as strangers. I'll give some examples in a minute, not all database corruption caused these errors, but it's pretty common, especially if the database

corruption is serious. And unfortunately, there is a hugely wide variety of areas that are possible depending on exactly what has gone wrong. But generally be on the lookout for errors, which seem to be complaining about things internal to the database system as opposed to user-facing things. So, for example, if you see an error on your log saying and starting this world would be a foreign key violation. That is not an indication of corruption. That's an indication that somebody tried to And there's no real problem there. But

you do, if you see arrows like this, that are complaining about things internal to the system, like being able to access the status of a transaction or being unable to open a file that's managed by the database or some problem, inside this index, where we can't, we find the parent key, whatever that means. We don't know what that means, right. Doesn't seem like it should logically be something that is the result of something that he used her did. That's the database complaining about some unexpected problem, internal to itself. So that may very well be a symptom of Corruption of this one

here. Cash. Lookup failed for relation. Some number is a super common thing that happens when people sit them catalogues are corrupted. Do all of these errors you you know, you might not know what they mean or maybe you do know what they mean, but you can see that they're all sort of complaining about stuff that isn't really normal. These are visible it's complaining about some internal thing if you don't know what these errors mean and you have them it's a really good time to get a support contract. If you don't have one already because you probably want to talk to someone who does know

specifically what these are is mean and what might have caused them to happen because then they can tell you in more detail exactly what has gone wrong. Sometimes you don't get an error, sometimes you get a database crash which seems like it shouldn't happen, but there are a few situations in which it does. You can get wrong answers to queries, maybe depending on whether or not, you use an index, the same query returns a different answer where you can get infinite Loop. Those problems can be very tricky to diagnose. The errors are a little easier because just by looking at the error

message you can usually kind of tell whether you know, some of these are things that you could confuse you could have confusion as to does, this represent database. Corruption or is this just a database, but that's not always 100% clear. But something, something's definitely pretty wrong. If you start seeing messages showing up like this complaining about internal, thanks. I actually originally had that 7 slides with different messages that we've actually seen come up, but there just wasn't time to go through that many messages. So let's say you realize your database is corrupted,

what do you do? Well the first thing you do is try everything you possibly can to avoid doing anything else with that. Brought the dead. For example, if the problem is just with a standby and the master looks, okay, don't try to fix the stand by throw it out, right? Maybe not the hardware but at least the database they're built, build a new standby right after a paycheck in your desk, Hardware, or whatever. If the problem is with the master, maybe you should consider restoring from backup. You're all taking backups, right? Everybody's taking

backups back. It's a really important or maybe the corruption is only on the master and it's not on the stand pipe. If that's the case, you could fail over to the standby and later rebuild the master at a future time. Or in some cases, people have data in postgresql that originally. They replicated from some other system. If that's the case, you might want to just throw away your database and we replicate the data and start over. You want to look for alternatives to having to try to get your data out of a corrupt database. Because once you have a database, your data is corrupted,

you may be able to get most of it back, you may be able to get all of it back but it's really hard to be a hundred percent certain that all of that data is actually still good and that you don't have any lingering problems which are going to cause you trouble later. So you do a lot of people, they plunged into corruption and they're like, how do I fix it? And the first answer to that question is, you don't, you do something else if there's any alternative But what if that's not the case, what if you're only one of your only copy of your critical data is in a database that is corrupted where

all of your copies are corrupted. All your back up, their corrupted, all your stand by their the only twice. That try to recover data from a corrupt, a database Well, first of all, please be really careful about that. I'm about to tell you what, I recommend that you do about this, but this recommendation comes with no guarantees. It is at your own risk. You may lose your data, please, do not blame me, I will tell you this if I told you before this is very risky exercise, a lot of caution The general approach that

I recommend based on my experience, if you have to do, this is first of all, before you do anything, make a copy of the database copy, all the files because you are potentially going to be doing things that make it worse. So if that doesn't pan out, you'd like to at least back up to, where are you be able to back up to where you were before? Copy at all. Secondly don't try to just put the damage database back into service. Your goal, should be to use PG, dump to backup. All the contents of the

database. Restore them into a database created by a new nftb and go from there. Because if you just try to repair the database, you got it may look like you succeed, but there may be hidden problems that are actually going to come back and bite you later, right? And without getting into the exact nature of what some of those things are, some of those are things that can take months, or years, to crop up and then all of a sudden, two years after you fixed the corruption, you're getting a strange error. How did that happen?

So basic procedure is back up. City dump. Restore into effect database. But it doesn't always go smoothly. Sometimes, in order to run PG dump, you have to be able to start the database. What happens if you can't There is a tool called PG-14 or PG reset X log that will often make your corrupted data base. Start, it does not make your corrupted data base, not corrupted. It might make it more corrupted than it was before. It makes it start, okay. And that's sometimes what you need. This is typically your only option if the necessary Wall Files have been lost or

corrupted and can't be recovered. Were similarly for PG. Control in other cases, more if he might want to consult with someone who's familiar with these topics. But if he does try very hard to make it start, not to make it good, make it start. If the database is so badly corrupted that there is no, hope of starting it. I you should have a look at the open-source PG file. Dump utility, if not, part of postgresql itself, it's another project. Conway and maintained it for years when he was at Red Hat. There's no other people maintaining

it and it has the ability in newer versions to recover data from the raw data files without needing to start the server. So if you're in a situation where most of your database has been blown up and you have a few files and you want to try to extract what they do. You can from those files, that's who I may be of help. But suppose we managed to start the database and then we still can't run Fifi dump. This is an extremely common scenario. He just dumped is designed to work on uncorrupted databases. It has sanity checks at it when you try to use it on a corrupted data base, their sanity

check and they fail you have to work around that problem to be able to run Fiji job so that you can get your data out. If it's a problem with a specific object. One approach, you consider is just dropping that object especially if that object is something like an index or a temporary table, that is, maybe not so critical. Another thing you can do, in the case of index uses, we index you won't be able to drop system in Texas. So if you have a corrupt system, index reindexing, it is often a good way to overcome that problem. If you can't dump the whole database, maybe you can drop

dump, individual tables were steam us so that you can at least get something out of it. If you can't dump a particular table, you might be able to get at least some of the data out of that table by using a select statement with some kind of where Clause, if he dumped his failing to dump the whole table, select without a, where Klaus has probably also going to fail, to do exactly the same reasons, but you might be able to select a subset of the day today in your table. There's a hidden column called ctid, which is the physical possession of the to pull within the table. So filtering. Can't be

helpful by saying, please give me the two poles that are not that one over there, which might be the problematic one. I'm Enterprise TV, also has a utility called PG, cat check. Which I Very useful in situations like this. In fact we wrote it because of situations like this. When you do have prompted in your system, catalog PG cat check in very quickly. Give you a very accurate idea of what the problems are and how widespread they are. I wish my help you to design an appropriate recovery strategy. This is not that petty, cash check is not

part of postgresql. It's in the Enterprise to be get Hub, repository and typically, if we have a customer who has this kind of problem and I contact us for support will say, could you download this tool compile? It run it on your server and then we'll be able to give you further instructions. I don't know whether other people will like it. I love it. I designed it. So, that might be why I love it, but it tells me what I want to know about how your system catalogues are corrupted. Whether that's what you want to know, another question can go, wrong, is maybe you're able to start the

database, just fine. Maybe you managed to run feed you dump without any problems. Maybe that'll work the first time or maybe after you beat at. Using some of the techniques that I I just described your other connect test techniques with which you may be familiar. You finally managed to get a dump and then you're like, yeah. All my problems are solved. So you going to try to restore the dump into a new database and it fails and you're like, what? I converted my data into a text format, how can I still have corruption? Will you can still have logical corruption, maybe your database with

corrupted? In such a way that a column that had a unique index on it. Ended up with duplicates shouldn't happen, right? But in a drop the database who know where maybe you wanted up with some foreign key violations in the database that that shouldn't occur. Those things won't prevent the data from being backed up by PG. But they will prevent you from restoring got dumped on the proper schema. There's no automatic solution to that problem, a human being has to look at the data and decide what to do. For example, if you have a duplicate row, you could merge the two rows or you could

delete one of them or you can decide to drop the unique index off of that table, and little both or whatever you want, right? But you're going to need to decide that based on your business logic, a human being is going to need to examine and say, I know something about what the state is actually supposed to mean, so I can make a judgment as to what the best thing to do is in this particular situation. Atlassian. Thanks a lot. I think we are. Are we out of time? I think we're out of time. 5 minutes. Okay, so I can take a couple of questions. Yes,

Is there any way to improve the time of backup, which is right now, 6 hours for 1 terabyte of the backup tools that are out there for post grass today. Have parallel and incremental backup functionality built into the Backup Tool, so you might want to check into those. I think did you pack rats? Have some stuff, I'm not sure about wrapmanager. And Spark. Has that capability built into it as well. We're actually working Enterprise TV is working with some other people in the community, to try to get parallel and incremental backup functionality into

postgresql core, which I think will provide some relief for people in your situation. But that's probably going to be a little while before we get all that engineering work done. I have a database, which is hunting for rabbits running are single. So after completing dancing when it starts it starts, but I heard you drop it and rebuild index, is there a solution for that? Don't do that. That's exactly what I mean about user error. Right, there's a procedure for backing up a running database and it's not that you just invented that,

right? It's not what the documentation has to do and it doesn't work. Andre indexing is only fixing a small portion of what's wrong with that database and you're very likely to have very serious problems at a later time if you proceed in that way. Okay, if it stopped, then it should work. Okay, I don't know why that's happening. One thing that sometimes causes indexes to get corrupted is if you copy the stop database cluster and you move it, move it to a different machine with a different and incompatible version of Lipsy on it because the collisions will be different. And so

the index on the text call, may look out of water if it's not that, I don't know, I don't think we can diagnose your particular database corruption problem on the fly right now but something's wrong. If you stopped at 8 of his first then I take it all back. That is safe. It should work and something else is going on. Pop-up Portal 2, 14 senses back by about three thousand terabytes of data. And no NFL's noise cuz we are by Brooklyn, gen-6. Witches and back and is flash storage.

Understand what? We are not faced any corruption has yet to understand what was that? Why you had to do is think about you not running Wisconsin given container has them, right? I know a lot about postgres or at least, I try to know a lot about the stress. I don't know, anything about your stand or anyone else is saying, I can't tell you how reliable you're saying is, right? But I can tell you that if you do testing right, if you deliberately crashed the database, we're not the database with the whole operating

system for the whole sad. And then you bring it back up and you look whether you see you have the kinds of corruption problems that he was talking about. Like, your index is being bad or foreign Keys being violated or duplicate Keys being violated. You can run your workload on that database crashing and see what happens if you don't get corruption, that's pretty good time. Doesn't prove things are good, but it's definitely an indicator. Yeah, I mean, there's so many variations in the hardware and software that people have. It's it, it's like, there's no General guidance that I can

give you. That makes absolutely sure that you're safe. You have to test your own environment, and I know that's a lot of work and it's disruption and might impact up time and it's hard. So people take their chances and that's fine that they have totally have the right to do that, but if you really want to make it a priority to make sure you did, don't get corruption, you have to test the kinds of scenarios that could cause corruption, which more than anything else is an operating system crash or storage crash. Those are the things that that that's when you really can't find out whether or not

your throat just reliable, right? If you wait for it to happen, then you'll find out at that time, whether it's likely to cause any problems. If you do it, five times, and it works fine, all five times and you don't get any corruption. It doesn't prove that you won't get corruption to the later time but it suggests you probably okay. Gets corrupted. The first time that you try a test. I told you something for sure, right? I think so. And shared buffers, the work performance impact is very small on if your database is, is bigger than than the

impact is somewhat more, I think you do what's a few percent. It's not huge but it's definitely something. I think people have constructed like, cases worth like a 10% regression with exactly the right workload. Not typically going to be that high and in more places is it possible to have that check some certain sum of a cups and like some of it? So take me back up from RB ascendant, restaurant ovdp, taking a backup from a previous, if it'll be serviced and then twisting it to the EDP

related to the talk show. Talk about that after work. Yeah, so we sometime back facing issue with the corruption where it said, some explosive block is affected in the in the file 1, 2, 3, let's say, okay, so it was my sandwich was working fine. So I promoted it, everything went well, but I have one question, so I do not try this matter. I just wanted to understand if my standby is working fine. There's no option for this car. And I know which particular fight is infected. So, will it make

sense? And will it work? If I simply copy that particular file from standby to measure and restart my phone? Okay, got it. Thank you. I think you're out of time.

Cackle comments for the website

Buy this talk

Access to the talk “PGConf India 2020 - Avoiding, Detecting, and Recovering From Data Corruption - Robert Haas - EDB”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Standart

Get access to all videos “PGConf India, 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “IT & Technology”?

You might be interested in videos from this event

September 28, 2018
Moscow
16
177
app store, apps, development, google play, mobile, soft

Similar talks

Amit Sharma
Senior Manager at EnterpriseDB
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Ajay Kumar
Assistant Engineer at Kerala State Electricity Board
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Jim Mlodgenski
Principal Engineer at Amazon Web Services
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “PGConf India 2020 - Avoiding, Detecting, and Recovering From Data Corruption - Robert Haas - EDB”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
735 conferences
30224 speakers
11293 hours of content