[Tux3] Inode table inode expansion now working
Daniel Phillips
phillips at phunq.net
Sat Aug 23 04:09:50 PDT 2008
We have arrived at a milestone of sorts: Tux3 can now search the inode
table btree for a free inode given a target, creating new inode table
blocks as necessary in a fairly reasonable way. Not quite perfect yet
because I think that the current logic leaves a hole where some nasty
pattern of inode creates could result in a lot of blocks with only one
inode in them. I have in mind what to do about that and will add a
refinement tomorrow (mask the inode split boundaries down to boundaries
at some binary granularity so random splits tend to land at the same
place).
It is pretty cool what we now have: a btree of variable sized inode
objects accessible at btree speed, whatever that means. Let's
quantify it.
When Ext3 accesses an inode on disk it directly calculates the position
of the containing inode table block using a arithmetic based on a
pre-determined number of inode table blocks per allocation group. Tux3
instead gets there by probing through a few levels of a btree. The
branching factor of the btree is a little less than 256 (depends on the
cleverness of the splitting algorithm as Antonio Vargas and I touched
on briefly). Without immediate file data and without versioning, a
Tux3 inode on disk should take somewhat less than the 128 bytes Ext3
uses, a lot less than the 256 byte Ext4 inodes and even less compared
to ZFS 512 byte "dnodes". Inode allocation policy will try to ensure
that most inode table blocks are nearly full, so call it 7 bits worth
of packing at the inode table leaves. Inode table index pointers are
16 bytes in the trivial, first cut design, and later may be 12 bytes
or 8 bytes. Anyway, at 16 bytes the branching factor is 8. Two levels
is 16, plus the leaf branching is 23, less a little for slack space
in index nodes. Two index levels easily handles a million inodes. If
the root is kept pinned in memory, then we just have one extra index
lookup vs Ext3 at that number of inodes. If we keep the last inode
accessed pinned in memory then we eliminate even that one extra lookup
most of the time and run at essentially the same speed of Ext3. But
with way more flexibility. This is nice.
So what advantage does Tux3 gain from variable sized inodes? To start
with, Tux3 does not have any choice in the matter because of attribute
versioning: changes to a multiply snapshotted inode attribute require
multiple versions of the attribute to be stored in the inode table
block. But there are a couple of other big advantages. Tux3 will only
store attributes that are actually used. If for example an inode has
no extended attributes then no space at all is taken for this.
Shapor has suggested that there be per-directory default uid, gid and
mode attributes, so any file with exactly those attrbutes does not have
to represent ownership at all, but inherits it from the inode table
block it lives in. Allocation policy will be such that its neighbours
are likely to have identical ownership. This can only be an advantage
if inodes are variable sized, and then it appears to be a significant
win.
Inodes not in use are represented by zero or two bytes, depending on
whether they separate in-use inodes or not. Again, this can only be
an advantage with variable sized inodes, and then it is a really big
advantage.
In Tux3, there are no empty inode table blocks, only empty spaces in
the btree. So I actually think that in the end, Tux3 will handily
beat Ext3 in inode table lookup speed by needing to read fewer blocks,
and Ext3 is pretty much the champ in that regard.
Tomorrow I will wax poetic about inode attribute groups, which have
just risen to the top of the list of things to implement.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list