Yes, emphatically agreed. I co-founded Fishworks, the group within Sun that shipped a ZFS-based appliance. The product was (and, I add with mixed emotion, still is) very commercially successful, and we shipped hundreds of thousands of spindles running ZFS in production, enterprise environments. And today I'm at Joyent, where we run ZFS in production on tens of thousands of spindles and support software customers running many tens of thousands more. Across all of that experience -- which has (naturally) included plenty of pain -- I have never needed or wanted anything resembling a traditional "fsck" for ZFS. Those that decry ZFS's lack of a fsck simply don't understand (1) the semantics of ZFS and specifically of pool import, (2) the ability to rollback transactions and/or (3) the presence of zdb (which we've taken pains to document in illumos[1], the repository of record for ZFS). So please, take it from someone with a decade of production experience with ZFS: it does not need fsck.
Hi. rsync.net here. I wonder if you concur with us that ZFS does need a defrag utility ?
I know you can (and we have) do the copy back and forth method to rebalance how things are organized on multiple vdevs inside one pool, but it would be nice if that were not a manual process, or could be optional on a scrub.
Or something.
People can and do run their pools up higher than 80% utilization. It happens. It's happened to you. There should be a non-surgical way to regain balanced vdevs after such a state...
Ah, yes -- now we're talking about a meaningful way in which ZFS can be improved! (But then, you know that.) Metaslab fragmentation is a very real issue -- and (as you point out) when pools are driven up in terms of utilization, that fragmentation can become acute. (Uday Vallamsetty at Delphix has an excellent blog entry that explores this visually and quantitatively.[1]) In terms of fixing it: ZFS co-inventor Matt Ahrens did extensive prototyping work on block pointer rewrite, but the results were mixed[2] -- and it was a casualty of the Oracle acquisition regardless. I don't know if the answer is rewriting blocks or behaving better when the system has become fragmented (or both), but this is surely a domain in which ZFS can be improved. I would encourage anyone interested in taking a swing at this problem to engage with Uday and others in the community -- and to attend the next ZFS Day[3].
We'll see what we can do. We have some ZFS firefighting that dominates our to-do list currently[1][2] but if we can work through that in short order we will dedicate some funds and resources to the FreeBSD side of ZFS and getting "defrag" in place.
I meant to attend zfs day and will try to come in 2013 if it is held.
[1] Space accounting. How much uncompressed space, minus ZFS metadata, does that ZFS filesystem actually take up ? Nobody knows.
Hey Bryan, I loved your Fork Yeah! talk at USENIX. Everyone who's interested in the history and future of Solaris and ZFS should watch it: https://www.youtube.com/watch?v=-zRN7XLCRhc
I know it's a bit off topic but could you elaborate on the issues you see with BTRFS?
From an operations perspective, ZFS is a god-send. Correct me if I'm wrong, but AFAIK BTRFS lacks any significant quanta of improvement over ZFS, merely a "better licensed" non-invented-here mostly duplicated effort with a couple refinements that isn't nearly as well thought-out from the experience of ops folk.
- Online scrubbing rather than fsck, so there is little/no downtime if the filesystem were interrupted, i.e., datacenter power failure. fsck/raid rebuild of large filesystems can mean lengthy outages for users.
- "Always" consistent: Start writing the data of a transaction to unallocated space (or ZIL) and update metadata last.
- Greatly configurable block device layer:
* RAIDZ, RAIDZ2, RAIDZ3, mirror, concat ...
* ZIL (fs journal), L2ARC (cache) can be placed on different media or even combinations of media.
- Send & receive snapshots across the network.
Reading about ZFS on Wikipedia and here [1] I see that there is a procedure called scrubbing. From [1]: "ZFS provides a mechanism to perform routine checking of all inconsistencies. This functionality, known as scrubbing, is commonly used in memory and other systems as a method of detecting and preventing errors before they result in hardware or software failure."
So I guess "alias fsck='zpool scrub tank'" would quail everyone's concerns.
Scrubbing is more like reading the block, calculate the checksum. It's the same as stored for the block (not at the block, important difference)? Nice. Not the same checksum. Go to the mirror and fetch the correct one, respectively use an algorithm to calculate the missing block from the raidz1/2/3 set and write it to the disk with the error.
I see. From the description it's quite obvious that ZFS handles most of the fault cases automatically. Still admins might want to force it to recheck all the blocks, this command achieves exactly that. Frankly I don't know what exactly fsck does and I guess it's different for every FS. For ZFS scrubbing seems to be an adequate alternative.
Hence if anyone asks you whether ZFS has fsck it's easier to say yes it does and it's called scrubbing.
[1] http://illumos.org/man/1m/zdb