Racks of servers, a low drone of fan noise. A breeze of cool air on your face as you walk through the cold isle. Disks, shallowly clicking away like a synchronized ballet reliably serving data to hungry processors and memory subsystems to be processed. Network switches, carrying data out to waiting customers to enrich their lives.
Well, that's the dream at least. But this is your datacenter. Your network has terminal cancer (the DNS kind).Your internet connection is fifty megabits too small and your old groaning disk subsystems have all the grace and poise of a Call of Duty team death match. Sure, there are problems, but accounting won't budge until they feel some pain, some damage. Problem is, by the time they feel it, it'll be too late. As in, too late to save your company. Because a disk just failed in your SAN, and what you didn't know is that it wasn't the only one not feeling well.
It’s what you don’t know that can hurt you. How error detection and correction on disks actually works. First off, all modern storage subsystems utilize error correction code (ECC). You may notice that term from your server memory. And on that very label, you'll notice in the fine print it always says "corrects single bit errors, detects double bit errors". That should be your first clue as to how ECC works. An ECC can detect an error up to twice the hamming distance of what it can correct. So if it can correct 46 bytes in your 512 byte sector hard drive, it can detect up to 92 bytes of error. What isn't correctable, is reported to the controller as uncorrectable and the disk controller increments the "uncorrectable error" counter in S.M.A.R.T.
Guess what happens to any error larger than that? It isn't detected. It is passed straight up the stack to the controller as good data. Yes, you read that right. Go read it again. We'll wait for your jaw to come off the floor.
But surely RAID will catch it, right? Wrong. RAID actually depends on the disk ECC subsystem to detect errors and report them, so it can pull data from another disk or reconstruct it via parity depending on your RAID level. Take RAID 5 for instance, does it hit every disk in your array for every single read to compute parity and make sure nothing is amiss? Negative. RAID does not preserve the integrity of your data. It only addresses availability of your data, nothing more.
So what about the file system? Hate to burst your bubble, but the most ubiquitous file systems in use today (NTFS, HFS+, XFS, UFS, EXT) are woefully underprepared for data corruption, and don't have any mechanisms to verify the data they are getting from the subsystem is good. Some can check their metadata, but that's about it.
Ok, well, what about your backups? Hate to break it to you, but what good are your backups when you've been feeding them corrupt data? Garbage in, garbage out. So we all need to be buying expensive, higher quality disks right? Thank Google for busting that myth wide open, as they discovered essentially identical failure rates among drives in their huge populations.
So, to recap. We have big corruption on our disks that is invisible to ECC. RAID is useless rubble and doesn't do anything but pass it on, because it relies on the disk's ECC to tell it the data is good or bad. The transport, even if it arrives intact courtesy of its own ECC, is still corrupt! The file system is blind to corruption and passes it right up to the application, which freaks out and in all likelihood crashes immediately. Our backups, which everyone likes to knee-jerk about as being the gold standard, are useless as well because they've been fed bad data too.
This is all stemming from a huge, industry-wide attitude that data integrity is simply taken for granted. No one really cares whether you have good data or bad data. They just care about doing it as quickly as possible so they look good on their performance benchmarks.
First off, in the SAN camp, they've finally gotten their brains pointed in the right direction and implemented a little thing called T10 PI (SCSI DIF, expanding each 512 byte sector to 520 bytes to hold some additional tags and a 16 bit CRC-16-T10-DIF checksum). Checksums are far more sensitive to disturbances in the force, and can reliably detect single bit errors. They just can't do anything about it but cry like Chicken Little. This field is rigidly defined, including the hash, so that all devices in the path can independently verify it. Nice. But there are some downsides to this implementation.
CRC-16, while reliable, is expensive to calculate in software, and HBAs that don't implement it in hardware pass on quite a significant tax to the CPU. And to top it off, if you actually have something like Oracle's DIX implementation (so that the OS can verify the checksum as well) you'll be doing it twice, one for the HBA and one for the OS.
Another problem is, you need special disks, special firmware, special HBAs, special drivers, and kernel support to do all that all the way up the stack. Very few vendors today actually support it, and unless you get a significant chunk of that out of the way (say, to the HBA), you're still in significant danger.
Oh, and I'm not done throwing dirt on that either. Do you have any good mechanisms to pull all data on the array at an opportune time and compute its checksums to verify that it’s still good? Because if even one sector is corrupt on a disk and you lose another disk, if you only have one disk margin of safety you are once again in the dock, as now you have no way to rebuild that data. The best way to reliably know that a sector has been recorded successfully to media is to read it!
On the personal side, what's happening to your vacation pictures or pictures of your newborn child? Did I mention that SATA doesn't have SCSI DIF? And by the way, when they went to 4k sectors, and "enhanced ECC", guess what they did? They "doubled" it. As in, they blew up the sectors to eight times the size, while only doubling the size of the per-sector ECC. Did you catch that math? You now have 1/4 of the ECC you previously had. Have fun with that.
If you're still with me, you're one step closer to becoming a paranoid storage zealot. So, what are we to do to combat this growing epidemic? By using modern file systems designed from the ground up for data integrity. ZFS for instance. It hashes for integrity (optionally even using cryptographically strong hashes like SHA256), and can automatically heal data using known good replicas (because it knows what data blocks are good and bad by checksumming on each read/write). It can also do scrubs that can pull every block of data from every disk in its array and ensure it's all still good at opportune times on a periodic basis. There are loads of other hugely useful features of ZFS too, but those are beyond the scope of this blog post.
It's time to get serious about data integrity. Silent corruption isn't a myth. It happens to real people in the real world. I was personally burned by it three times before I started using ZFS. Now you have some options – it’s time to go use them and save your data, and quite possibly your whole company. Contact NetWork Center, Inc. if you have any questions.