[Tux3] Page cache emulation now up and running
Daniel Phillips
phillips at phunq.net
Sun Aug 10 04:14:13 PDT 2008
The tux3 code is starting to get a little more interesting now.
Another significant part of the kernel vfs machinery is now emulated:
the page cache. Which in userland does not have to care about pages,
so it works in buffers instead, and instead of a special index field
we just reinterpret the block field as an index. This is the way the
Linux kernel does it anyway (actually I dimly recall that change came
about in Linux as a response to the way I handled directory block
access in htree) but this is obscured by the confusion between block
and page units of granularity. The kernel uses a page cache with a
direct device mapping to implement the buffer cache, while my code
uses the buffer cache to implement the equivalent of the kernel page
cache. The latter is a much cleaner way to do things, but is not an
option for the kernel until we get around to generalizing page size.
Anyway, both in kernel and in my emulation, the block interface is
just the vererable BSD standard getblk and bread, with writing done
by marking buffers dirty and flushing them out en mass periodically.
Easy to understand and get right.
This page cache emulation effort is about making directory operations
happen. We want to scan sequentially through an entire directory file
looking for an entry or a place to create a new one. So we need a
notion of what sequentially means. One way is to walk an inode btree,
but that is not the way the kernel does things, and that would be a
problem when it comes time to port. What the kernel does is read each
block of a directory file into a page cache the first time somebody
accesses it, and after that, the cached block can be accessed very
efficiently without needing to read it again or having to go messing
around in the filesystem metadata. This is really, really powerful.
You do not want to go trying to reimplement that kind of caching at
the filesystem level, because you will just end up with a lot of code
that does not do nearly as good a job as the kernel page cache.
The way the kernel gets a block into the page cache for the first time
is to call the filesystem's get_block method to do the logical to
physical mapping for each block backing a page cache page, or more
precisely, it invokes a filesystem method to read a page, which calls a
kernel library function that does a callback to the filesystem's
supplied get_block function. Pretty well all Linux filesystems use that
library function, so we might as well think of this as the vfs calling
the filesystem get_block method by a twisty path.
Tux3 is going to do things a little differently. Instead of using that
twisty library function, it will just go get the page that the vfs is
asking for, eliminating a whole mess of calls back and forth, and
in theory doing things somewhat more efficiently by being able to look
up all the blocks for a page at once instead of doing a separate
get_block for each one. In practice, Linux filesystem blocksize almost
always matches the hardware page size (or else performance will suck)
so there is only one get_block call per page. If we ever get around to
properly supporting huge pages then this will matter a lot. For now it
just feels clean.
Now I need tux3_readblock that gets called for any file cache miss, and
tux3_writeblock to flush dirty blocks to disk. These are nearly there.
In the case of readblock, just a rearrangement of the responsibilities
of filemap_readblock and the existing tuxread. The filemap_readblock
method will probe the file btree to find the physical block before
calling diskread, and tuxread, instead of directly doing the btree
probe as it does now, will just call bread on the inode mapping.
Therefore, tuxread is about to get a whole lot more efficient because
only the first access hits the filesystem metadata. Just like in
kernel.
The new behavior for tuxwrite will be even nicer: it is now just going
to do a getblk in the filemap hash, transfer data onto it and mark the
buffer dirty. No filesystem metadata will be touched until it is time
to flush dirty blocks to disk. This is "delayed allocation", which is
usally a big feature that gets added to a filesystem some time late in
its life if ever, but it just comes for free with the tux3 approach.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list