[Tux3] Five kinds of btrees
Daniel Phillips
phillips at phunq.net
Fri Jul 25 02:39:10 PDT 2008
Me thinking out loud, which I tend to do better when I know people will
be reading it...
Tux3 has five different kinds of btrees at the moment, all to be
accessed by the same generic btree code. I thought I would list them
now and consider the differences, as a prelude to writing the various
methods needed to specialize to the different variants.
I tend to be obsessive about packing things into even binary sized
chunks if I can, thinking that processors like to operate on data that
way. It probably does not matter a whole lot, but it satisfies me
esthetically and if it gains a tiny amount of speed, that is nice.
Well, maybe it does actually matter. A whole lot of cpu time is spent
binary searching btree index blocks, and if the index entries line up
on nice boundaries then the L1 cache hit rate goes up.
A more important consideration is for the btree index node entries to be
compact and achieve a high fanout. Except for directory btrees (which
are mapped into the logical blocks of files and can therefore get away
with 24 bit logical block pointers) Tux3 index nodes have 16 bytes per
entry, generally consisting of a 48 bit key, 48 bit address of a lower
level index node or leaf, and free space hints to round things out to
even numbers.
In two cases (file index and atime table) I could not think of anything
useful to round out the twelve bytes of essential payload, so I call
that unused and probably will eventually relax my obsession with binary
sized data and make those entries 12 bytes long to improve the fanout
by 25%.
Kinds of btrees, in pidgin C:
inode table
node: { inum:48 node:48 freehint:32 }[] // 16 bytes/entry
node: { inum:10 node:48 freehint:6 }[] // 8 bytes/entry variant
leaf: attributes
file index
node: { offset:48 node:48 unused:32 }[] // 16 bytes/entry... maybe 12?
leaf: { block:48 count:6 version:10 }[] // 8 bytes/entry
free extents
node: { block:48 node:48 freehint:32 }[] // 16 bytes/entry
leaf: { block:48 count:16 }[] // 8 bytes/entry
atime table
node: { inum:48 node:48 unused:32 }[] // 16 bytes/entry... maybe 12?
leaf: { version:10 ablock:48 unused:6 }[] // 8 bytes/entry
ablock: { date:32 }[] // 4 bytes
directory
node: { hash:31 cont:1 node:32 }[] // 8 bytes
leaf: { hash:31 dblock:32 }[] // 8 bytes
dblock: ext2 dirents
For those who have not lived and breathed btrees (most of us) I should
state the definition of a btree node: N block pointers separated by N-1
keys. There is one less key than block pointers because there is
always a higher level btree block that supplies the key that separates
two btree nodes. Anyway, a vector of [pointer, key] pairs makes a
btree index, with one of the keys being unused. I sometimes fiendishly
coopt that unused field for something else, even though this saves
only .2% of the space in a block. I will try to resist that temptation
this time, and maybe even always fill in that unused field with a copy
of the higher level key and use it for an internal cross check.
Free hints give an idea of the density and maximum size of the free
space on the volume in the case of the free tree, or density in case of
the inode table. This allows me to avoid having a secondary structure
like Ext3's inode bitmap in order to locate free inodes. Instead we
look in a part of the inode table with a suitable density, and search
through the leaf block for the free inode we know is there.
Free space in PHTree directories is handled by an entirely different
method described in the original Tux3 post (which I should eventually
repost here with improvements). The free hint idea cannot be used here
because there are potentially multiple terminal index blocks referring
to any given dirent block, and no way to find the referers other than
the one we used to reach the block. Fortunately, the non-sparse
logical space of the directory file lends itself to mapping by a radix
tree, which as I mentioned earlier can map freespace for about half a
million dirents with a single 4K block, lazily updated.
There are two different kinds of index nodes for the inode btree: the 16
byte kind and the 8 byte kind, the latter being useful deeper in a huge
inode table where the btree indexes have cut the inode space up pretty
small so that the low ten bits or so of the inum is a unique index.
The free hint can also be smaller. The thing is, there should end up
being a lot more of the eight byte kind of inode index node because
they are on the bushy end of the tree, so this should be a significant
time and space saver and well worth the little bit of extra code to
implement it.
The file index btree is almost as important to optimize as the inode
table btree, because while there may not be many huge files, they are
often things like databases and virtual machine root filesystems that
require good performance. So I try to keep the terminal index extents
to 8 bytes including the version, which means that any gaps between
extents for a particular version have to be filled in explicitly with
extents of empty space. This should be the rare case. I don't know
how well this is going to work out. It would obviously be easier to
put a logical offset in each versioned extent. I will probably try to
code it the 8 byte way and see if the problem yields easily.
The atime table is a unwanted stepchild of ancient unix. It is hard to
feel much love for it, but there it is, it has its own unique structure
because the atimes have to be versioned. There is not much point in
packing atimes into the leaf blocks like attributes are packed into the
inode table. The leaves of the atime index tree are actually
nonterminals because they reference blocks of atime dates. Hmm, I
suppose those could be versioned extents of atimes, which would make
the atime index really small but then raises the tough question of how
big to make the extent when somebody accesses an inode. For now, no
extents there.
I might as well specify the structures for the non-extent prototype
while I am in here.
inode table
node: { inum:48 node:48 }[]
leaf: attributes
file index
node: { offset:48 node:48 }[]
leaf: { block:48 version:10 }[]
free extents
use bitmaps instead
atime table
forget it
directory
straight up ext2 dirent blocks, linearly searched
Considerably simpler, no? And yet it should prove the concept of this
filesystem versioning method nicely.
The prototype will lift the btree leaf code for file indexes directly
from ddsnap, including the leaf format that I have come to hate, but
does the job. I will probably improve the leaf format shortly after
and backport that into ddsnap.
It seems kind of odd, but Ext2 dirent blocks may well end up as the
final directory leaf format for Tux3 because the prime requirement is
that dirents not move around (physical stability) which lets out all
the fancy in-leaf indexing schemes I know of. It would be kind of nice
if that little piece of Ext2 survived, because Tux3 pretty much tosses
out all the rest of it.
The inode table leaf format is new territory, somewhat described in the
Tux3 project announcement. The details are worth a post in their own
right.
The generic btree will probably get coded over the weekend, including
methods for the inode table and file index, all in userspace for the
time being.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list