[Tux3] Design note: Basic SMP locking

Mon Dec 29 04:24:19 PST 2008

First we write the code, then we do the design.

Well, generally it is a better idea to do things in the other order, but 
in the case of SMP locking I had a general idea what needed doing and 
wanted to see how some of the code looked before finishing the design 
note.  Then the trial patch made such a big stability improvement I did 
another one... then another one... until coverage was complete as far 
as I could see.  So now I can write about what was done, and we can 
think about what holes I left, if any.

Locking design is a slippery subject.  There is no way to prove the 
absence of race conditions, only their presence.  We have to think 
about every bit of data that Tux3 owns in kernel and make sure all of 
it is either unshared, read only or covered by some lock whenever it is 
accessed.

So we start by enumerating everything a Tux3 filesystem has in memory:

  - Superblock fields

  - Inode table btree in buffer cache

  - Cached inodes with attributes
      - Extended attribute cache (xcache)

  - File index btrees in buffer cache

  - Data blocks of special files in page cache:
      - Allocation bitmap
      - Directories
      - Atom table

That is it.  Pretty simple, really.  We will add a few more animals to 
this menagerie over time, for example a hash to accelerate xattr atom 
lookup and some bookkeeping for atomic merge, but the main structures 
have already arrived.

Now we go through each and ask "what, if anything, already serializes 
access to this object?"  If the answer is "nothing" or "not always" 
then we need to add serialization.  The following are taken care of for 
us:

  VFS superblock:

     - Everything in it is read-only or protected by the VFS
       with lock_super() (sb->s_lock mutex)

     - VFS superblock points to the Tux3 superblock, which contains
       fields not protected by the VFS superblock locking

  Directories:

      - Directory entry blocks are taken care of for us by VFS taking
        ->i_mutex of the directory inode for all namespace operations.

      - Directory file btrees are not protected, see below.

   Extended attributes

      - sys_removexattr and sys_setxattr take the inode's i_mutex

      - getxattr and listxattr have no smp locking

      - It would probably be better if VFS did not take any locks for
        the xattr-modifying operations either and leave that to the
        the generic xattr function, but oh well.

That is all the help we get for free.  Everything else needs some 
intervention.  Hirofumi ran some fsx-linux tests and found that Tux3 
did not last long before hitting one of its self-checking asserts, and 
that when this happened, tracing output from multiple different disk 
operations would be interleaved together in the log.  A sure sign of a 
race.  (In fact, symptoms of races are typically far stranger: not all 
kernel code is loaded with asserts or generates tons of tracing output 
like Tux3 does right now.)

This was not even with an SMP system: it was a single processor 
environment with preemption enabled.  Kernel preemption produces races 
somewhat like SMP, because a kernel task can be preempted between any 
two machine instructions.  This was the official opening of hunting 
season for Tux3 SMP races.

I started with file index btrees.  These are always accessed through our 
get_segs function, which the VFS block IO library calls every time it 
needs to find out the disk location a data block should be written to 
or read from.  So get_segs now takes a shared read lock on the root of 
a btree to do a pure lookup, or an exclusive write lock to modify the 
btree when it records newly assigned disk locations.  With just this 
one small modification checked in, fsx-linux started succeeding on some 
runs.  Still not stable, but a huge improvement.

Thus encouraged, I thought the remaining failures were most probably due 
to races in the inode table.  The inode table is also a btree, so the 
root already had a semaphore due to the data btree work.  Now, 
open_inode takes a read lock and make_inode takes a write lock.

Hirofumi mentioned that truncates where hitting asserts, so I added a 
write lock around the tree_chop.  With this, fsx-linux would run for an 
hour before hitting an assert.

I then took an excursion through extended attribute locking because I 
thought it would be quick, but it was not.  Adding the locking turned 
up some structural defects, and the API I had implemented for get_xattr 
was completely unworkable for SMP - a pointer to the inode's xattr 
cache was returned, and nothing would prevent the cache from changing 
before the caller copied the result to userspace.  This was changed to 
a copy into a buffer supplied by the caller, done inside the lock.  
This time I used a mutex, because even get_xattr can change the atom 
use count, which changes an on-disk table, so a shared read lock would 
not work.  Well, list_xattrs could be done under a read lock, but my 
goals at this point do not have much to do with optimizing xattr 
locking.  Later, we will optimize things so that most xattr operations 
just take spinlocks.

The mutex I used was one that just happened to be lying around - the 
atom table's i_mutex.

After that detour, I returned to what I knew was most likely the cause 
of the remaining races: block allocation.  It was pretty easy to wrap 
balloc and bfree in mutex locking using the allocation table's i_mutex, 
again just because it happened to be lying around unused.  Flushing the 
allocation table to disk can cause more allocations, and I introspected 
for some time about whether this could deadlock on trying to acquire 
the allocation table mutex recursively, and finally satisfied myself 
that the VFS would never flush data to disk while holding the 
allocation tables i_mutex.

With the block allocation properly serialized, fsx-linux would now run 
for a couple of hours, then hit a bug that was not a race.  Which 
Hirofumi fixed today incidentally, but the point is... SMP races seem 
to be pretty much gone now.

I never actually tested any of this SMP locking code.  That is hard to 
do.  I just reasoned about it, and Hirofumi exercised it.  Somebody 
really needs to invent a way to unit test SMP locking.

I then went back to compile notes for this post and look into some 
possible issues for completeness.  One question was the Tux3-specific 
superblock.  There are various fields there for various purposes.  All 
of them were either read-only, or covered by locking on some subsystem.  
For example, the allocation variables are covered by the allocation 
table mutex, because they are only changed inside the locks, and 
reading outside the locks is purely advisory.  There are some 
xattr-related fields in the superblock that are only accessed by code 
inside the xattr locks.

Better SMP locking...  My SMP locking is crude and inefficient, I admit 
it freely.  Particularly for the inode table btree and allocation 
bitmaps, which really need to support a lot of activity in parallel.  
Improving this promises to be fine sport over the next few months.  
There are no end of clever techniques that can be brought to bear.  For 
now, we just want it to work.

Side note: we ended up doing a big refactoring of the way Tux3 
initializes btrees, just to initialize our new semaphore in the btree 
root.  This cleanup was badly needed.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3