Terry : ZFS Gotchas

ZFS Read Me First

Things Nobody Told You About ZFS

1. Virtual Devices Determine IOPS

IOPS (I/O per second) are mostly a factor of the number of virtual devices (vdevs) in a zpool. They are not a factor of the raw number of disks in the zpool. This is probably the single most important thing to realize and understand, and is commonly not.

ZFS stripes writes across vdevs (not individual disks). A vdev is typically IOPS bound to the speed of the slowest disk within it. So if you have one vdev of 100 disks, your zpool's raw IOPS potential is effectively only a single disk, not 100. There's a couple of caveats on here (such as the difference between write and read IOPS, etc), but if you just put as a rule of thumb in your head that a zpool's raw IOPS potential is equivalent to the single slowest disk in each vdev in the zpool, you won't end up surprised or disappointed.

2. Deduplication Is Not Free

Another common misunderstanding is that ZFS deduplication, since its inclusion, is a nice, free feature you can enable to hopefully gain space savings on your ZFS filesystems/zvols/zpools. Nothing could be farther from the truth. Unlike a number of other deduplication implementations, ZFS deduplication is on-the-fly as data is read and written. This creates a number of architectural challenges that the ZFS team had to conquer, and the methods by which this was achieved lead to a significant and sometimes unexpectedly high RAM requirement.

Every block of data in a dedup'ed filesystem can end up having an entry in a database known as the DDT (DeDupe Table). DDT entries need RAM. It is not uncommon for DDT's to grow to sizes larger than available RAM on zpools that aren't even that large (couple of TB's). If the hits against the DDT aren't being serviced primarily from RAM or fast SSD, performance quickly drops to abysmal levels. Because enabling/disabling deduplication within ZFS doesn't actually do anything to data already on disk, do not enable deduplication without a full understanding of its requirements and architecture first. You will be hard-pressed to get rid of it later.

3. Snapshots Are Not Backups

This is critically important to understand. ZFS has redundancy levels from mirrors and raidz. It has checksums and scrubs to help catch bit rot. It has snapshots to take lightweight point-in-time captures of data to let you roll back or grab older versions of files. It has all of these things to help protect your data. And one 'zfs destroy' by a disgruntled employee, one fire in your datacenter, one random chance of bad luck that causes a whole backplane, JBOD, or a number of disks to die at once, one faulty HBA, etc, etc, etc -- and poof, your pool is gone. I've seen it. Lots of times. MAKE BACKUPS.

4. ZFS Destroy Can Be Painful

Something often waxed over or not discussed about ZFS is how it presently handles destroy tasks. This is specific to the "zfs destroy" command, be it used on a zvol, filesystem, clone or snapshot. This does not apply to deleting files within a ZFS filesystem or on the filesystem formatted onto a zvol, etc. It also does not apply to "zpool destroy". ZFS destroy tasks are potential downtime causers, when not properly understood and treated with the respect they deserve. Many a SAN has suffered impacted performance or full service outages due to a "zfs destroy" in the middle of the day on just a couple of terabytes (no big deal, right?) of data. The truth is a "zfs destroy" is going to go touch many of the metadata blocks related to the object(s) being destroyed. Depending on the block size of the destroy target(s), the number of metadata blocks that have to be touched can quickly reach into the millions, even the hundreds of millions.

If a destroy needs to touch 100 million blocks, and the zpool's IOPS potential is 10,000, how long will that zfs destroy take? Somewhere around 2 1/2 hours! That's a good scenario - ask any long-time ZFS support person or administrator and they'll tell you horror stories about day long, even week long "zfs destroy" commands. There's eventual work that can be done to make this less painful (a major one is in the works right now) and there's a few things that can be done to mitigate it, but at the end of the day, always check the actual used disk size of something you're about to destroy and potentially hold off on that destroy if it's significant. How big is too big? That is a factor of block size, pool IOPS potential, extenuating circumstances (current I/O workload of the pool, deduplication on or off, a few other things).

5. RAID Cards vs HBA's

ZFS provides RAID, and does so with a number of improvements over most traditional hardware RAID card solutions. ZFS uses block-level logic for things like rebuilds, it has far better handling of disk loss & return due to the ability to rebuild only what was missed instead of rebuilding the entire disk, it has access to more powerful processors than the RAID card and far more RAM as well, it does checksumming and auto-correction based on it, etc. Many of these features are gone or useless if the disks provided to ZFS are, in fact, RAID LUN's from a RAID card, or even RAID0 single-disk entities offered up.

If your RAID card doesn't support a true "JBOD" (sometimes referred to as "passthrough") mode, don't use it if you can avoid it. Creating single-disk RAID0's (sometimes called "virtual drives") and then letting ZFS create a pool out of those is better than creating RAID sets on the RAID card itself and offering those to ZFS, but only about 50% better, and still 50% worse than JBOD mode or a real HBA. Use a real HBA - don't use RAID cards.

6. SATA vs SAS

This has been a long-standing argument in the ZFS world. Simple fact is, the majority of ZFS storage appliances, most of the consultants and experts you'll talk to, and the majority of enterprise installations of ZFS are using SAS disks. To be clear, "nearline" SAS (7200 RPM SAS) is fine, but what will often get you in trouble is the use of SATA (including enterprise-grade) disks behind bad interposers (which is most of them) and SAS expanders (which almost every JBOD is going to be utilizing).

Plan to purchase SAS disks if you're deploying a 'production' ZFS box. In any decent-sized deployment, they're not going to have much of a price delta over equivalent SATA disks. The only exception to this rule is home and very small business use-cases -- and for more on that, I'll try to wax on about it in a post later.

7. Compression Is Good (Even When It Isn't)

It is the very rare dataset or use-case that I run into these days where compress=on (lzjb) doesn't make sense. It is on by default on most ZFS appliances, and that is my recommendation. Turn it on, and don't worry about it. Even if you discover that your compression ratio is nearly 0% - it still isn't hurting you enough to turn it off, generally speaking. Other compression algorithms such as gzip are another matter entirely, and in almost all cases should be strongly avoided.

8. RAIDZ - Even/Odd Disk Counts

Try (and not very hard) to keep the number of data disks in a raidz vdev to an even number. This means if its raidz1, the total number of disks in the vdev would be an odd number. If it is raidz2, an even number, and if it is raidz3, an odd number again. Breaking this rule has very little repercussion, however, so you should do so if your pool layout would be nicer by doing so.

9. Pool Design Rules

A variety of simple rules I tell people to follow when building zpools:

  •  Do not use raidz1 for disks 1TB or greater in size.
  • For raidz1, do not use less than 3 disks, nor more than 7 disks in each vdev.
  • For raidz2, do not use less than 5 disks, nor more than 10 disks in each vdev.
  • For raidz3, do not use less than 7 disks, nor more than 15 disks in each vdev.
  • Mirrors trump raidz almost every time. Far higher IOPS potential from a mirror pool than any raidz pool, given equal number of drives.
  • For 3TB+ size disks, 3-way mirrors begin to become more and more compelling.
  • Never mix disk sizes (within a few %, of course) or speeds (RPM) within a single vdev.
  • Never mix disk sizes (within a few %, of course) or speeds (RPM) within a zpool, except for l2arc & zil devices.
  • Never mix redundancy types for data vdevs in a zpool.
  • Never mix disk counts on data vdevs within a zpool (if the first data vdev is 6 disks, all data vdevs should be 6 disks).
  • If you have multiple JBOD's, try to spread each vdev out so that the minimum number of disks are in each JBOD. If you do this with enough JBOD's for your chosen redundancy level, you can even end up with no SPOF (Single Point of Failure) in the form of JBOD, and if the JBOD's themselves are spread out amongst sufficient HBA's, you can even remove HBA's as a SPOF.

If you keep these in mind when building your pool, you shouldn't end up with something tragic.

10. 4KB Sector Disks

There are a number of in-the-wild devices that are 4KB sector size instead of the old 512-byte sector size. ZFS handles this just fine if it knows the disk is 4K sector size. The problem is a number of these devices are lying to the OS about their sector size, claiming it is 512-byte (in order to be compatible with ancient Operating Systems like Windows 95); this will cause significant performance issues if not dealt with at zpool creation time.

11. ZFS Has No "Restripe"

If you're familiar with traditional RAID arrays, then the term "restripe" is probably in your vocabulary. Many people in this boat are surprised to hear that ZFS has no equivalent function at all. The method by which ZFS delivers data to the pool has a long-term equivalent to this functionality, but not an up-front way nor a command that can be run to kick off such a thing.

The most obvious task where this shows up is when you add a vdev to an existing zpool. You could be forgiven to expect that the existing data in the pool would slide over and all your vdevs would end up of roughly equal used size, since that's what a traditional RAID array would do. ZFS? It won't. That data balancing will only come as an indirect result of rewrites. If you only ever read from your pool, it'll never happen.

12. Hot Spares

Don't use them. Pretty much ever. Warm spares make sense in some environments. Hot spares almost never make sense. Very often it makes more sense to include the disks in the pool and increase redundancy level because of it, than it does to leave them out and have a lower redundancy level.

13. ZFS Is Not A Clustered Filesystem

I don't know where this got started, but at some point, something must have been said that has led some people to believe ZFS is or has clustered filesystem features. It does not. ZFS lives on a single set of disks in a single system at a time, period. Various HA technologies have been developed to seamlessly move the pool from one machine to another in case of hardware issues, but they move the pool - they don't offer up the storage from multiple heads at once.

14. To ZIL, Or Not To ZIL

This is a common question - do I need a ZIL (ZFS Intent Log)? So, first of all, this is the wrong question. In almost every storage system you'll ever build utilizing ZFS, you will need and will have a ZIL. The first thing to explain is that there is a difference between the ZIL and a ZIL (referred to as a log or slog) device. It is very common for people to call a log device a "ZIL" device, but this is wrong - there is a reason ZFS' own documentation always refers to the ZIL as the ZIL, and a log device as a log device. Not having a log device does not mean you do not have a ZIL!

So with that explained, the real question is, do you need to direct those writes to a separate device from the pool data disks or not? In general, you do if one or more of the intended use-cases of the storage server are very write latency sensitive, or if the total combined IOPS requirement of the clients is approaching say 30% of the raw pool IOPS potential of the zpool. In such scenarios, the addition of a log vdev can have an immediate and noticeable positive performance impact. If neither of those is true, it is likely you can just skip a log device and be perfectly happy. Most home systems, for example, have no need of a log device.

15. ARC and L2ARC

One of ZFS' strongest performance features is its intelligent caching mechanisms. The primary cache, stored in RAM, is the ARC (Adaptive Replacement Cache). The secondary cache, typically stored on fast media like SSD's, is the L2ARC (second level ARC). Basic rule of thumb in almost all scenarios is don't worry about L2ARC, and instead just put as much RAM into the system as you can, within financial realities. ZFS loves RAM, and it will use it - there is a point of diminishing returns depending on how big the total working set size really is for your dataset(s), but in almost all cases, more RAM is good. If your use-case does lend itself to a situation where RAM will be insufficient and L2ARC is going to end up being necessary, there are rules about how much addressable L2ARC one can have based on how much ARC (RAM) one has.

16. Just Because You Can, Doesn't Mean You Should

ZFS has very few limits - and what limits it has are typically measured in zillions, and are thus unreachable with modern hardware. Does that mean you should create a single pool made up of 5,000 hard disks? In almost every scenario, the answer is no. The fact that ZFS is so flexible and has so few limits means, if anything, that proper design is more important than in legacy storage systems. It is a truism that in most environments that need lots of storage space, it is likely more efficient and architecturally sound to find a smaller-than-total break point and design systems to meet that size, then build more than one of them to meet your total space requirements. There is almost never a time when this is not true.

It is very rare for a company to need 1 PB of space in one filesystem, even if it does need 1 PB in total space. Find a logical separation and build to meet it, not go crazy and try to build a single 1 PB zpool. ZFS may let you, but various hardware constraints will inevitably doom this attempt or create an environment that works, but could have worked far better at the same or even lower cost.

Learn from Google, Facebook, Amazon, Yahoo and every other company with a huge server deployment -- they learned to scale out, with lots of smaller systems, because scaling up with giant systems not only becomes astronomically expensive, it quickly ends up being a negative ROI versus scaling out.

17. Crap In, Crap Out

ZFS is only as good as the hardware it is put on. Even ZFS can corrupt your data or lose it, if placed on inferior components. Examples of things you don't want to do if you want to keep your data intact include using non-ECC RAM, using non-enterprise disks, using SATA disks behind SAS expanders, using non-enterprise class motherboards, using a RAID card (especially one without a battery), putting the server in a poor environment for a server to be in, etc.