[Tux3] Comparison to Hammer fs design
Daniel Phillips
phillips at phunq.net
Thu Jul 24 13:26:27 PDT 2008
I read Matt Dillon's Hammer filesystem design with interest:
http://apollo.backplane.com/DFlyMisc/hammer01.pdf
Link kindly provided by pgquiles. The big advantage Hammer has over
Tux3 is, it is up and running and released in the Dragonfly distro.
The biggest disadvantage is, it runs on BSD, not Linux, and it so
heavily implements functionality that is provided by the VFS and block
layer in Linux that a port would be far from trivial. It will likely
happen eventually, but probably in about the same timeframe that we can
get Tux3 up and stable.
Tux3 is a simpler design than Hammer as far as I can see, and stays
closer to our traditional ideas of how a filesystem behaves, for
example there is no requirement for a background process to be
continuously running through the filesystem reblocking it to recover
space. Though Tux3 does have the notion of followup metadata passes
to "promote" logical forward log changes to physical changes to btree
nodes etc. However this does not have to be a daemon, it can just be
something that happens every so many write transactions in the process
context that did the write. Avoiding daemons in filesystems is good -
each one needs special attention to avoid deadlock, and they mess up
the the ps list, a minor but esthetic consideration.
Matt hit on a similar idea to versioned pointers, that is, his birth and
death version numbers for disk records. So we both saw independently
that recursive copy on write as in WAFL, ZFS and Btrfs is suboptimal.
I found that only the birth version is actually required, simply because
file data elements never actually die, they is only ever overwritten or
truncated away. Therefore, a subsequent birth always implies the death
of the previous data element, and only the birth version has to be
stored, which I simply call the "version". Data element death by
truncate is handled by the birth of a new (versioned) size attribute.
Eventually Matt should realize that too, and rev Hammer to improve its
storage efficiency. Oddly, Hammer only seems to support a linear chain
of versions, whereas I have shown that with no increase in the size of
metadata (except for the once-per-volume version tree) you can store
writable versions with arbitrary parentage. I think Matt should take
note of that too and incorporate it in Hammer.
Some aspects of the Hammer seem quite inefficient, so I wonder what he
means when he says it performs really well. In comparison to what?
Well I don't have a lot more to say about that until Tux3 gets to the
benchmark stage, and they we will be benchmarking mainly against Ext3,
XFS and Btrfs.
Matt seems somewhat cavalier about running out of space on small
volumes, whereas I think a filesystem should scale all the way from a
handful of meg to at least terabytes and preferably petabytes. The
heavy use of a vacuum like reblocking process seems undesirable to me.
I like my disk lights to go out as soon as the data is safely on the
platter, not continue flashing for minutes or hours after every period
of activity. Admittedly, I did contemplate something similar for
ddsnap, to improve write efficiency. I now think that fragmentation
can be held down to a dull roar without relying on a defragger, and
that defragging should only be triggered at planned times by an
administrator. We will see what happens in practice.
Tux3 has a much better btree fanout than Hammer, 256 vs 64 for Hammer
using the same size 4K btree index blocks. Fanout is an important
determinant of the K in O(log(N)) btree performance, which turns out to
be very important when comparing different filesystems, all of which
theoretically are Log(N), but some of which have an inconveniently
large K (ZFS comes to mind). I always try to make the fanout as high
as possible in my btree, which for example is a major reason that the
HTree index for Ext3 performs so well.
Actually, I think I can boost the Tux3 inode table btree fanout up to
512 by having a slightly different format for the next-to-terminal
inode table index blocks with 16 bits inum, 48 bits leaf block, because
at the near-terminal index nodes the inum space is already divided down
to a small range.
More comments about Hammer later as I learn more about it.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list