[Tux3] Design note: Buffer forking

Thu Jan 8 04:16:02 PST 2009

Buffer forking is a technique I proposed to allow dirty metadata blocks 
in cache to be further modified while a version that belongs to an 
earlier delta is still in flight to disk.  It is a kind of copy on 
write: the memory backing a buffer is swapped out for a copy that can 
be modified without affecting the original.

In contrast to block redirect which copies data from one buffer to 
another, buffer forking affects a single buffer.  We could say buffer 
forking splits a buffer in time, leaving a read-only copy in an earlier 
delta and a writable copy in the current delta.

Buffer forking is a front end operation, while block redirect is a back 
end operation.

Buffer forking was conceived as a method of avoiding data dependencies 
between cache layers that would cause pipeline stalls.  Even though our 
immediate plans are to run deltas one at a time, so that there is only 
a single cache layer, buffer forking is still useful as a means of 
dealing with the allocate-during-flush problem that comes up when 
flushing bitmaps to disk.

Buffer forking takes place inside the blockdirty(buffer) function.  The 
idea is simple:

    if the buffer is dirty in an earlier delta
       allocate a new buffer
       swap the buffer's backing memory with the new buffer
       remove the old buffer from the earlier delta list
       insert the new buffer on the earlier delta list
    set the buffer dirty in the current delta
    insert the buffer on the current delta list

Our immediate use case is flushing bitmaps.  For now, this will happen 
at delta transition time.  We have a list of bitmaps that are dirty in 
the current delta.  We increment the delta counter, so the current 
delta becomes the committing delta, and move the list.  Now we walk the 
commit list and allocate or reallocate physical addresses for each 
block (map_region).  If an allocation sets a bit in a bitmap block that 
is dirty in the committing delta, that block will be forked and become 
dirty in the current delta, leaving a read only copy on the commit 
delta list.  Finally, we walk the committing list again, initiating the 
write transfer.

What could be simpler?  Well, to be honest, a lot of things could be 
simpler.  But this is a pretty simple algorithm compared to trying to 
handle all the cases of allocate-in-allocate some other way.  Just for 
fun, try it!

Forking is easy to implement in userspace: each buffer has a pointer to 
memory for a block, and we just change the pointer.  In kernel, there 
are complications:

  * Multiple buffers are packed onto a single page of memory.

  * Multiple tasks may access a buffer asynchronously.

So we want to replace a page which may be shared by, say, four buffers, 
each of which may be clean, dirty, in process of being read or written, 
or not in use at all.  Is it even possible?  I think it does work, just 
barely.

When we want to fork a buffer, we start by bringing its underlying page 
uptodate (read_mapping_page).  This gets rid of the otherwise very 
nasty case of read IO that might be in progress on one or more other 
blocks on the same page.  After the read_mapping_page, each buffer on 
the page is either clean, dirty in the current delta, or dirty in some 
previous delta.  We have to look at the cases.

  Other dirty buffer on same page in current delta:

     Not possible, if that was the case this page would already have
     been forked

  Other dirty buffers on same page in earlier delta:

     After the page copy, all other dirty buffers are now dirty in the
     current delta and may be freely modified thereafter.  They were
     forked "for free".  They are either not being accessed, or being
     read, in which case it is ok for the reading code to continue to
     read the copy of the buffer on the old page.  If the accessing
     code then decides to write, the buffer is already forked.

  Clean buffers being read by filesystem code:

     The data being read is unchanged by the copy, as for a dirty
     buffers in earlier delta being accessed for read.

Fork algorithm with multiple blocks per page:

  Allocate a new page and put buffers on it
  Use read_mapping_page to bring the full page uptodate
  Take the page lock (protects the buffer list)
  Swap the pages (set b_page for each buffer, exchange page->private)
  Walk the two buffer lists together:
    If the old buffer is dirty in a previous delta
       Set it dirty in the current delta
       Move it from the earlier delta list to the current delta list
       Add the new buffer to the earlier delta list
  Release the page lock

Users of blocks that can fork have to take some extra care.  The address 
of the buffer data can change asynchronously, but not if the buffer is 
dirty in the current delta.  A user that reads buffer data must be 
aware that two calls to bufdata(buffer) may return different addresses.  
Usually there is only one call to bufdata(buffer) per blockread(buffer) 
so that is not a big change.  A user that first reads buffer data then 
writes to it will call blockdirty(buffer), and then it must get the 
data pointer from bufdata(buffer) again.  This requires changes in a 
few places in our code.

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3