[Tux3] Five kinds of btrees

Fri Jul 25 02:39:10 PDT 2008

Me thinking out loud, which I tend to do better when I know people will 
be reading it...

Tux3 has five different kinds of btrees at the moment, all to be 
accessed by the same generic btree code.  I thought I would list them 
now and consider the differences, as a prelude to writing the various 
methods needed to specialize to the different variants.

I tend to be obsessive about packing things into even binary sized 
chunks if I can, thinking that processors like to operate on data that 
way.  It probably does not matter a whole lot, but it satisfies me 
esthetically and if it gains a tiny amount of speed, that is nice.
Well, maybe it does actually matter.  A whole lot of cpu time is spent 
binary searching btree index blocks, and if the index entries line up 
on nice boundaries then the L1 cache hit rate goes up.

A more important consideration is for the btree index node entries to be
compact and achieve a high fanout.  Except for directory btrees (which 
are mapped into the logical blocks of files and can therefore get away 
with 24 bit logical block pointers) Tux3 index nodes have 16 bytes per 
entry, generally consisting of a 48 bit key, 48 bit address of a lower 
level index node or leaf, and free space hints to round things out to 
even numbers.

In two cases (file index and atime table) I could not think of anything 
useful to round out the twelve bytes of essential payload, so I call 
that unused and probably will eventually relax my obsession with binary 
sized data and make those entries 12 bytes long to improve the fanout 
by 25%.

Kinds of btrees, in pidgin C:

inode table
  node: { inum:48 node:48 freehint:32 }[] // 16 bytes/entry
  node: { inum:10 node:48 freehint:6 }[] // 8 bytes/entry variant
  leaf: attributes

file index
  node: { offset:48 node:48 unused:32 }[] // 16 bytes/entry... maybe 12?
  leaf: { block:48 count:6 version:10 }[] // 8 bytes/entry

free extents
  node: { block:48 node:48 freehint:32 }[] // 16 bytes/entry
  leaf: { block:48 count:16 }[] // 8 bytes/entry

atime table
  node: { inum:48 node:48 unused:32 }[] // 16 bytes/entry... maybe 12?
  leaf: { version:10 ablock:48 unused:6 }[] // 8 bytes/entry
  ablock: { date:32 }[] // 4 bytes

directory
  node: { hash:31 cont:1 node:32 }[] // 8 bytes
  leaf: { hash:31 dblock:32 }[] // 8 bytes
  dblock: ext2 dirents

For those who have not lived and breathed btrees (most of us) I should 
state the definition of a btree node: N block pointers separated by N-1 
keys.  There is one less key than block pointers because there is 
always a higher level btree block that supplies the key that separates 
two btree nodes.  Anyway, a vector of [pointer, key] pairs makes a 
btree index, with one of the keys being unused.  I sometimes fiendishly 
coopt that unused field for something else, even though this saves 
only .2% of the space in a block.  I will try to resist that temptation 
this time, and maybe even always fill in that unused field with a copy 
of the higher level key and use it for an internal cross check.

Free hints give an idea of the density and maximum size of the free 
space on the volume in the case of the free tree, or density in case of 
the inode table.  This allows me to avoid having a secondary structure 
like Ext3's inode bitmap in order to locate free inodes.  Instead we 
look in a part of the inode table with a suitable density, and search 
through the leaf block for the free inode we know is there.

Free space in PHTree directories is handled by an entirely different 
method described in the original Tux3 post (which I should eventually 
repost here with improvements).  The free hint idea cannot be used here 
because there are potentially multiple terminal index blocks referring 
to any given dirent block, and no way to find the referers other than 
the one we used to reach the block.  Fortunately, the non-sparse 
logical space of the directory file lends itself to mapping by a radix 
tree, which as I mentioned earlier can map freespace for about half a 
million dirents with a single 4K block, lazily updated.

There are two different kinds of index nodes for the inode btree: the 16 
byte kind and the 8 byte kind, the latter being useful deeper in a huge 
inode table where the btree indexes have cut the inode space up pretty 
small so that the low ten bits or so of the inum is a unique index.  
The free hint can also be smaller.  The thing is, there should end up 
being a lot more of the eight byte kind of inode index node because 
they are on the bushy end of the tree, so this should be a significant 
time and space saver and well worth the little bit of extra code to 
implement it.

The file index btree is almost as important to optimize as the inode 
table btree, because while there may not be many huge files, they are 
often things like databases and virtual machine root filesystems that 
require good performance.  So I try to keep the terminal index extents 
to 8 bytes including the version, which means that any gaps between 
extents for a particular version have to be filled in explicitly with 
extents of empty space.  This should be the rare case.  I don't know 
how well this is going to work out.  It would obviously be easier to 
put a logical offset in each versioned extent.  I will probably try to 
code it the 8 byte way and see if the problem yields easily.

The atime table is a unwanted stepchild of ancient unix.  It is hard to 
feel much love for it, but there it is, it has its own unique structure 
because the atimes have to be versioned.  There is not much point in 
packing atimes into the leaf blocks like attributes are packed into the 
inode table.  The leaves of the atime index tree are actually 
nonterminals because they reference blocks of atime dates.  Hmm, I 
suppose those could be versioned extents of atimes, which would make 
the atime index really small but then raises the tough question of how 
big to make the extent when somebody accesses an inode.  For now, no 
extents there.

I might as well specify the structures for the non-extent prototype 
while I am in here.

inode table
  node: { inum:48 node:48 }[]
  leaf: attributes

file index
  node: { offset:48 node:48 }[]
  leaf: { block:48 version:10 }[]

free extents
  use bitmaps instead

atime table
  forget it

directory
  straight up ext2 dirent blocks, linearly searched

Considerably simpler, no?  And yet it should prove the concept of this 
filesystem versioning method nicely.

The prototype will lift the btree leaf code for file indexes directly 
from ddsnap, including the leaf format that I have come to hate, but 
does the job.  I will probably improve the leaf format shortly after 
and backport that into ddsnap.

It seems kind of odd, but Ext2 dirent blocks may well end up as the 
final directory leaf format for Tux3 because the prime requirement is 
that dirents not move around (physical stability) which lets out all 
the fancy in-leaf indexing schemes I know of.  It would be kind of nice 
if that little piece of Ext2 survived, because Tux3 pretty much tosses 
out all the rest of it.

The inode table leaf format is new territory, somewhat described in the 
Tux3 project announcement.  The details are worth a post in their own 
right.

The generic btree will probably get coded over the weekend, including 
methods for the inode table and file index, all in userspace for the 
time being.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3