[Tux3] More about the free tree
Daniel Phillips
phillips at phunq.net
Sat Aug 16 02:37:51 PDT 2008
The biggest bitmap
What is the largest volume we can map with a bitmap? On a 32 bit
system, it is 2^32 page cache blocks * 2^12 bytes per block * 2^3 bits
per byte * 2^12 block size = 2^59. One bit short of the promised
exabyte maximum Tux3 volume size. The limitation is the size of the
page cache index on a 32 bit system; on a 64 bit system there is no
such limitation.
Even if a filesystem is created on a 64 bit system it still should be
readable on a 32 bit system. We could just say for now that bitmap
allocation is avoided in favor of the free extent tree as described
in the previous free tree post, only for the part of a volume beyond
half an exabyte. Not that anybody is likely to test that in the near
future!
Another fix is to allocate two kernel mappings, one for the lower half
of the bitmap table and the other for the upper half. Or we could be
lame and just declare that Tux3 is limited to half an exabyte on 32 bit
systems. There is plenty of time to solve this little issue, I just
thought I would mention it now for completeness.
More on bitmap recursion
Putting the bitmap into a file is kind of an exciting thing to do, not
necessarily in the good sense of exciting. Flushing the bitmap to disk
will occasionally cause allocations either of physical blocks to hold
newly instantiated bitmap blocks (first bit set in a block that was
previously a hole in the bitmap file) or more rarely, nodes and leaf
block for the file index btree. So after flushing the bitmap to disk
we may find that some buffered blocks of the bitmap are dirty again as
a result of allocations performed during the flush. Hmm. One could
possibly construct a case where arbitrarily many flushes are required
to get all of the bitmap buffers into a clean state.
Preallocating the entire bitmap to avoid this bizarre effect is not an
option if I intend to go through with my plan to combine a sparse
bitmap with an extent tree to get the best of both allocation methods.
So what I am going to do is think about this issue a little before
doing anything design wise or code wise.
When I first started talking about Tux3, Maciej (lurking here on the
list) waxed poetic about the virtues of using actual files to represent
filesystem objects that happen to resemble files, and I expressed a
general fear of strange recursions down that path. Well here we have
a nice example of one of those. And I am going to press on and deal
with it. So I guess Maciej wins that point... provided this self
dirtying behaviour can be tamed elegantly.
Filesystem bootstrap
Creating a new Tux3 filesystem requires allocating a number of objects,
including objects involved in allocation. Another nice recursion
there: you have to allocate space for objects, but the block allocator
is not initialized yet. This is typically handled by laying out all
those initial allocations by hand, using ad hoc fully written out code
in a special mkfs.<filesystem> utility. Ddsnap does this, and though
we tried hard, that initial create code looks ugly and is unpleasant to
work on. I do not doubt that most filesystems have this issue.
Tux3 has a nice solution now: because of the nice separation between
cached buffer mappings and on disk representation, the allocator can be
initialized just by creating a new inode, which itself does not require
any block allocations. If you look at the main routine in inode.c you
see:
* Create the device, device buffer map and superblock descriptors
* Initialize the kernel buffer cache emulation
* Create the bitmap inode and store it in the superblock
* Block allocator is now functional
Initialization continues by allocating the root of the inode table, at
which point it is possible to create the root directory as a normal
file create. The test code then goes on to create a file in the root
directory, flush various maps to disk and so on. This is satisfactory
indeed, and is likely to strongly resemble the production code.
Allocation strategy
The allocator in balloc.c currently implements a simple, cyclic
strategy where the next allocation goal is always immediately after
the last one, with wraparound to the bottom of the volume. This is
a horribly deficient way to do things. For one thing, it could take
a very long time to find a free block if all blocks in a region happen
to be allocated or the filesystem is nearly full. For another, mixing
together deletes and creates causes extreme fragmentation with this
strategy. Not as bad as completely random allocation, but close to
it. A production version of Tux3 needs a reasonably sophisticated
allocation strategy in order to be at all usable with rotating media.
I am going to be returning to this question over and over again as
Tux3 matures. For today, Tux3 has enough allocation machinery to show
itself as a prototype, but not to survive head to head benchmarking in
good style.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list