[Tux3] Design note: Btree locking

Daniel Phillips phillips at phunq.net
Wed Nov 26 18:45:00 PST 2008

(From now on I will make a habit of design notes as such so we can 
easily find them later to write proper design documentation.)

SMP locking is needed right about now, since we are now in kernel, and 
we are more or less completely broken until we have it.  We will start 
with very crude, robust and easy to implement btree locking, then 
iteratively improve it as we go to reduce SMP contention.

   Style 0: No locking at all, as now.  Tux3 will work reliably only
      in a uniprocessor kernel (CONFIG_SMP=n).

   Style 1: rwsem per btree.  Simple and effective.  To read a btree
      block, take a read lock on the rwsem in the btree root struct,
      to change the block, take a write lock.  On SMP there will be
      significant contention under many loads, particularly on the inode
      table btree.

   Style 2: Crabbing.  On a btree probe, take the buffer lock on each
      successive block in the patch and release the block on the parent
      in the patch after getting the new lock.  To split, repeat the
      probe but do not release the parent locks.  An improvement on this
      is to remember the highest parent block that might be split from
      the first attempt and only continue to hold parent locks from that
      point down.  This fairly easy strategy will give a huge
      improvement in concurrency.  A further improvement is to use
      rwsemaphores so that other readers do not have to wait while
      reading a child block from disk.

   Style 3: Cursors.  This is a suggestion from the fertile mind of Matt
      Dillon, as interpreted by me.  Each btree_path (I would like to
      rename this as struct cursor) holds long term shared locks on its
      upper N levels.  We can therefore read those index blocks with no
      locking at all, and because each cursor level points at a cached
      block, we do not have to probe a radix tree to find the block. 
      Now the interesting part is what we do when a block pinned by a
      cursor has to be split or merged.  There are various workable
      stategies including RCU.  I will leave decisions on these details
      for later, since this is a little ways off.

By the time we get to style 3 we will be scalling well into the 
dozens-of-cpus zone.  Even style 2 will work pretty well on multi-core 
machines that people actually have.



Tux3 mailing list
Tux3 at tux3.org

More information about the Tux3 mailing list