[Tux3] Comparison to Hammer fs design

Thu Jul 24 13:26:27 PDT 2008

I read Matt Dillon's Hammer filesystem design with interest:

   http://apollo.backplane.com/DFlyMisc/hammer01.pdf

Link kindly provided by pgquiles.  The big advantage Hammer has over 
Tux3 is, it is up and running and released in the Dragonfly distro.  
The biggest disadvantage is, it runs on BSD, not Linux, and it so 
heavily implements functionality that is provided by the VFS and block 
layer in Linux that a port would be far from trivial.  It will likely 
happen eventually, but probably in about the same timeframe that we can 
get Tux3 up and stable.

Tux3 is a simpler design than Hammer as far as I can see, and stays 
closer to our traditional ideas of how a filesystem behaves, for 
example there is no requirement for a background process to be 
continuously running through the filesystem reblocking it to recover 
space.  Though Tux3 does have the notion of followup metadata passes 
to "promote" logical forward log changes to physical changes to btree 
nodes etc.  However this does not have to be a daemon, it can just be 
something that happens every so many write transactions in the process 
context that did the write.  Avoiding daemons in filesystems is good - 
each one needs special attention to avoid deadlock, and they mess up 
the the ps list, a minor but esthetic consideration.

Matt hit on a similar idea to versioned pointers, that is, his birth and 
death version numbers for disk records.  So we both saw independently 
that recursive copy on write as in WAFL, ZFS and Btrfs is suboptimal.

I found that only the birth version is actually required, simply because 
file data elements never actually die, they is only ever overwritten or 
truncated away.  Therefore, a subsequent birth always implies the death 
of the previous data element, and only the birth version has to be 
stored, which I simply call the "version".  Data element death by 
truncate is handled by the birth of a new (versioned) size attribute.

Eventually Matt should realize that too, and rev Hammer to improve its 
storage efficiency.  Oddly, Hammer only seems to support a linear chain 
of versions, whereas I have shown that with no increase in the size of 
metadata (except for the once-per-volume version tree) you can store 
writable versions with arbitrary parentage.  I think Matt should take 
note of that too and incorporate it in Hammer.

Some aspects of the Hammer seem quite inefficient, so I wonder what he 
means when he says it performs really well.  In comparison to what?  
Well I don't have a lot more to say about that until Tux3 gets to the 
benchmark stage, and they we will be benchmarking mainly against Ext3, 
XFS and Btrfs.

Matt seems somewhat cavalier about running out of space on small 
volumes, whereas I think a filesystem should scale all the way from a 
handful of meg to at least terabytes and preferably petabytes.  The 
heavy use of a vacuum like reblocking process seems undesirable to me.  
I like my disk lights to go out as soon as the data is safely on the 
platter, not continue flashing for minutes or hours after every period 
of activity.  Admittedly, I did contemplate something similar for 
ddsnap, to improve write efficiency.  I now think that fragmentation 
can be held down to a dull roar without relying on a defragger, and 
that defragging should only be triggered at planned times by an 
administrator.  We will see what happens in practice.

Tux3 has a much better btree fanout than Hammer, 256 vs 64 for Hammer 
using the same size 4K btree index blocks.  Fanout is an important 
determinant of the K in O(log(N)) btree performance, which turns out to 
be very important when comparing different filesystems, all of which 
theoretically are Log(N), but some of which have an inconveniently 
large K (ZFS comes to mind).  I always try to make the fanout as high 
as possible in my btree, which for example is a major reason that the 
HTree index for Ext3 performs so well.

Actually, I think I can boost the Tux3 inode table btree fanout up to 
512 by having a slightly different format for the next-to-terminal 
inode table index blocks with 16 bits inum, 48 bits leaf block, because 
at the near-terminal index nodes the inum space is already divided down 
to a small range.

More comments about Hammer later as I learn more about it.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3