Block Handles, continued

Fri Aug 17 18:36:36 PDT 2012

Data Handle Functionality

Reference Counting

The lifetime of a cached block image is controlled by reference counting, provided by
getblk, bread and brelse for the traditional buffer cache, and in the Tux3 userspace
buffer library by blockget, blockread and blockput. Blockget (getblk) and blockread
(bread) increment a per-buffer refcount while blockput (brelese) decrements it. Only
a buffer with zero refcount may be evicted from cache.

The new (proposed) block handles API does not provide reference counting of individual
blocks, because a block is not the fundamental cache unit. Rather, several blocks
share a page and all blocks cached on a given page must have zero reference count
before the page (and thus all the block cached on it) may be evicted from cache.

The new block handles API provides reference counting like the traditional buffer API,
but not of individual blocks. Instead, the reference count of the underlying cache page
is incremented and decremented. This saves a little memory and corresponds to the
reality of how blocks are grouped together in a page cache. This approach also removes
a layer of confusion. The linux __brelse function has the commentary:

   "If all buffers against a page have zero reference count, are clean and unlocked,
   and if the page is clean and unlocked then try_to_free_buffers() may strip the
   buffers from the page in preparation for freeing it"

And to make matters worse, the semantics of try_to_free_buffers are difficult to
understand. The new block handles API hopes to bypass much of this unhelpful
complexity.

Block Indexing

The main purpose of the buffer API is to index a cached block given a block index,
which might be either a physical block on disk or a logical block of a file
(the difference between physical and logical cache mapping lies entirely in the
read/write methods supplied to transfer a cached block from or to disk). Therefore,
a blkget or bread takes a physical block index as a parameter and remembers it in
the buffer_head structure in order to flush blocks to disk. This model serves
reasonably well for primitive filesystems such as Ext2 that lack transactional
semantics and do not provide recovery guarantees, but does not work well for Tux3,
where flushing a dirty block to the same address it was read from would nearly
always amount to a recovery guarantee violation.

Where modern Linux filesystems do use the traditional buffer API, the functionality
of the physical block number embedded in the buffer_head is superceded by the page
cache index of the page to which the buffer is attached by means of a circular list.
So the traditional buffer->block_nr no longer does anything useful.

The new block handles API references a given cache block by a handle object that
contains the address of a page and the index number of one the blocks on that page.
A block handle fills roughly the same role as a (buffer_head *) in the traditional
buffer API: a handle is returned by blockget and blockread instead of a pointer to
a buffer, and that handle may be used later to release the block or to access its
data, or to lock or unlock it.

User Space Implementation

The block handles API relies on the existence of a page cache. Our current userspace
implementation of Tux3 does not in fact emulate a page cache in user space, nor
would there be much advantage to that except to make the userspace code more similar
to kernel code. It is for consideration whether we want to add an emulation of the
kernel space to the kernel code, or as seems more reasonable, continue to use the
existing buffer cache implementation but with modifications to bring it closer to
the new handle api. For example, we may wish to write:

    #ifdef __KERNEL__
        typedef struct handle { ... } handle;
    #else
        typedef struct buffer *handle
    #endif

Kernel Implementation

For an initial kernel implementation I propose to place the entire handle state
array in the page.private field of struct page, so we have:

    struct handle { struct page *page; unsigned i; };

As discussed above, this limits the possible blocksize/pagesize combinations, but
still supports our common case usage. It simplifies the implementation by removing
the need to manage some external object to hold per block state. We can change
this design later with only local effect. So for now, I would like to err on the
side of simplicity and leave the extension to full generality for later. (This is
for comment.)