Yes, emphatically agreed. I co-founded Fishworks, the group within Sun that ship...

rsync · on March 29, 2013

Hi. rsync.net here. I wonder if you concur with us that ZFS does need a defrag utility ?

I know you can (and we have) do the copy back and forth method to rebalance how things are organized on multiple vdevs inside one pool, but it would be nice if that were not a manual process, or could be optional on a scrub.

Or something.

People can and do run their pools up higher than 80% utilization. It happens. It's happened to you. There should be a non-surgical way to regain balanced vdevs after such a state...

bcantrill · on March 29, 2013

Ah, yes -- now we're talking about a meaningful way in which ZFS can be improved! (But then, you know that.) Metaslab fragmentation is a very real issue -- and (as you point out) when pools are driven up in terms of utilization, that fragmentation can become acute. (Uday Vallamsetty at Delphix has an excellent blog entry that explores this visually and quantitatively.[1]) In terms of fixing it: ZFS co-inventor Matt Ahrens did extensive prototyping work on block pointer rewrite, but the results were mixed[2] -- and it was a casualty of the Oracle acquisition regardless. I don't know if the answer is rewriting blocks or behaving better when the system has become fragmented (or both), but this is surely a domain in which ZFS can be improved. I would encourage anyone interested in taking a swing at this problem to engage with Uday and others in the community -- and to attend the next ZFS Day[3].

[1] http://blog.delphix.com/uday/2013/02/19/78/

[2] http://permalink.gmane.org/gmane.os.illumos.devel/5203

[3] http://zfsday.com/zfsday/

rsync · on March 29, 2013

We'll see what we can do. We have some ZFS firefighting that dominates our to-do list currently[1][2] but if we can work through that in short order we will dedicate some funds and resources to the FreeBSD side of ZFS and getting "defrag" in place.

I meant to attend zfs day and will try to come in 2013 if it is held.

[1] Space accounting. How much uncompressed space, minus ZFS metadata, does that ZFS filesystem actually take up ? Nobody knows.

[2] extattrs + busy ZFS == crash

anon987 · on March 29, 2013

Hey Bryan, I loved your Fork Yeah! talk at USENIX. Everyone who's interested in the history and future of Solaris and ZFS should watch it: https://www.youtube.com/watch?v=-zRN7XLCRhc

I know it's a bit off topic but could you elaborate on the issues you see with BTRFS?

ballard · on March 30, 2013

From an operations perspective, ZFS is a god-send. Correct me if I'm wrong, but AFAIK BTRFS lacks any significant quanta of improvement over ZFS, merely a "better licensed" non-invented-here mostly duplicated effort with a couple refinements that isn't nearly as well thought-out from the experience of ops folk.

- Online scrubbing rather than fsck, so there is little/no downtime if the filesystem were interrupted, i.e., datacenter power failure. fsck/raid rebuild of large filesystems can mean lengthy outages for users. - "Always" consistent: Start writing the data of a transaction to unallocated space (or ZIL) and update metadata last. - Greatly configurable block device layer: * RAIDZ, RAIDZ2, RAIDZ3, mirror, concat ... * ZIL (fs journal), L2ARC (cache) can be placed on different media or even combinations of media. - Send & receive snapshots across the network.

anon987 · on March 30, 2013

I totally agree.

From what I understand ZFS will never be supported for RHEL (and that's mostly what I work with) so I'm hoping for the best with BTRFS.

notimetorelax · on March 29, 2013

Reading about ZFS on Wikipedia and here [1] I see that there is a procedure called scrubbing. From [1]: "ZFS provides a mechanism to perform routine checking of all inconsistencies. This functionality, known as scrubbing, is commonly used in memory and other systems as a method of detecting and preventing errors before they result in hardware or software failure."

So I guess "alias fsck='zpool scrub tank'" would quail everyone's concerns.

[1] http://docs.oracle.com/cd/E19082-01/817-2271/gbbwa/

c0t0d0s0 · on March 29, 2013

Scrubbing is more like reading the block, calculate the checksum. It's the same as stored for the block (not at the block, important difference)? Nice. Not the same checksum. Go to the mirror and fetch the correct one, respectively use an algorithm to calculate the missing block from the raidz1/2/3 set and write it to the disk with the error.

notimetorelax · on March 29, 2013

I see. From the description it's quite obvious that ZFS handles most of the fault cases automatically. Still admins might want to force it to recheck all the blocks, this command achieves exactly that. Frankly I don't know what exactly fsck does and I guess it's different for every FS. For ZFS scrubbing seems to be an adequate alternative.

Hence if anyone asks you whether ZFS has fsck it's easier to say yes it does and it's called scrubbing.