[Tux3] Design note: Basic SMP locking
phillips at phunq.net
Mon Dec 29 04:24:19 PST 2008
First we write the code, then we do the design.
Well, generally it is a better idea to do things in the other order, but
in the case of SMP locking I had a general idea what needed doing and
wanted to see how some of the code looked before finishing the design
note. Then the trial patch made such a big stability improvement I did
another one... then another one... until coverage was complete as far
as I could see. So now I can write about what was done, and we can
think about what holes I left, if any.
Locking design is a slippery subject. There is no way to prove the
absence of race conditions, only their presence. We have to think
about every bit of data that Tux3 owns in kernel and make sure all of
it is either unshared, read only or covered by some lock whenever it is
So we start by enumerating everything a Tux3 filesystem has in memory:
- Superblock fields
- Inode table btree in buffer cache
- Cached inodes with attributes
- Extended attribute cache (xcache)
- File index btrees in buffer cache
- Data blocks of special files in page cache:
- Allocation bitmap
- Atom table
That is it. Pretty simple, really. We will add a few more animals to
this menagerie over time, for example a hash to accelerate xattr atom
lookup and some bookkeeping for atomic merge, but the main structures
have already arrived.
Now we go through each and ask "what, if anything, already serializes
access to this object?" If the answer is "nothing" or "not always"
then we need to add serialization. The following are taken care of for
- Everything in it is read-only or protected by the VFS
with lock_super() (sb->s_lock mutex)
- VFS superblock points to the Tux3 superblock, which contains
fields not protected by the VFS superblock locking
- Directory entry blocks are taken care of for us by VFS taking
->i_mutex of the directory inode for all namespace operations.
- Directory file btrees are not protected, see below.
- sys_removexattr and sys_setxattr take the inode's i_mutex
- getxattr and listxattr have no smp locking
- It would probably be better if VFS did not take any locks for
the xattr-modifying operations either and leave that to the
the generic xattr function, but oh well.
That is all the help we get for free. Everything else needs some
intervention. Hirofumi ran some fsx-linux tests and found that Tux3
did not last long before hitting one of its self-checking asserts, and
that when this happened, tracing output from multiple different disk
operations would be interleaved together in the log. A sure sign of a
race. (In fact, symptoms of races are typically far stranger: not all
kernel code is loaded with asserts or generates tons of tracing output
like Tux3 does right now.)
This was not even with an SMP system: it was a single processor
environment with preemption enabled. Kernel preemption produces races
somewhat like SMP, because a kernel task can be preempted between any
two machine instructions. This was the official opening of hunting
season for Tux3 SMP races.
I started with file index btrees. These are always accessed through our
get_segs function, which the VFS block IO library calls every time it
needs to find out the disk location a data block should be written to
or read from. So get_segs now takes a shared read lock on the root of
a btree to do a pure lookup, or an exclusive write lock to modify the
btree when it records newly assigned disk locations. With just this
one small modification checked in, fsx-linux started succeeding on some
runs. Still not stable, but a huge improvement.
Thus encouraged, I thought the remaining failures were most probably due
to races in the inode table. The inode table is also a btree, so the
root already had a semaphore due to the data btree work. Now,
open_inode takes a read lock and make_inode takes a write lock.
Hirofumi mentioned that truncates where hitting asserts, so I added a
write lock around the tree_chop. With this, fsx-linux would run for an
hour before hitting an assert.
I then took an excursion through extended attribute locking because I
thought it would be quick, but it was not. Adding the locking turned
up some structural defects, and the API I had implemented for get_xattr
was completely unworkable for SMP - a pointer to the inode's xattr
cache was returned, and nothing would prevent the cache from changing
before the caller copied the result to userspace. This was changed to
a copy into a buffer supplied by the caller, done inside the lock.
This time I used a mutex, because even get_xattr can change the atom
use count, which changes an on-disk table, so a shared read lock would
not work. Well, list_xattrs could be done under a read lock, but my
goals at this point do not have much to do with optimizing xattr
locking. Later, we will optimize things so that most xattr operations
just take spinlocks.
The mutex I used was one that just happened to be lying around - the
atom table's i_mutex.
After that detour, I returned to what I knew was most likely the cause
of the remaining races: block allocation. It was pretty easy to wrap
balloc and bfree in mutex locking using the allocation table's i_mutex,
again just because it happened to be lying around unused. Flushing the
allocation table to disk can cause more allocations, and I introspected
for some time about whether this could deadlock on trying to acquire
the allocation table mutex recursively, and finally satisfied myself
that the VFS would never flush data to disk while holding the
allocation tables i_mutex.
With the block allocation properly serialized, fsx-linux would now run
for a couple of hours, then hit a bug that was not a race. Which
Hirofumi fixed today incidentally, but the point is... SMP races seem
to be pretty much gone now.
I never actually tested any of this SMP locking code. That is hard to
do. I just reasoned about it, and Hirofumi exercised it. Somebody
really needs to invent a way to unit test SMP locking.
I then went back to compile notes for this post and look into some
possible issues for completeness. One question was the Tux3-specific
superblock. There are various fields there for various purposes. All
of them were either read-only, or covered by locking on some subsystem.
For example, the allocation variables are covered by the allocation
table mutex, because they are only changed inside the locks, and
reading outside the locks is purely advisory. There are some
xattr-related fields in the superblock that are only accessed by code
inside the xattr locks.
Better SMP locking... My SMP locking is crude and inefficient, I admit
it freely. Particularly for the inode table btree and allocation
bitmaps, which really need to support a lot of activity in parallel.
Improving this promises to be fine sport over the next few months.
There are no end of clever techniques that can be brought to bear. For
now, we just want it to work.
Side note: we ended up doing a big refactoring of the way Tux3
initializes btrees, just to initialize our new semaphore in the btree
root. This cleanup was badly needed.
Tux3 mailing list
Tux3 at tux3.org
More information about the Tux3