[Tux3] Design note: Btree locking
Daniel Phillips
phillips at phunq.net
Wed Nov 26 18:45:00 PST 2008
(From now on I will make a habit of design notes as such so we can
easily find them later to write proper design documentation.)
SMP locking is needed right about now, since we are now in kernel, and
we are more or less completely broken until we have it. We will start
with very crude, robust and easy to implement btree locking, then
iteratively improve it as we go to reduce SMP contention.
Style 0: No locking at all, as now. Tux3 will work reliably only
in a uniprocessor kernel (CONFIG_SMP=n).
Style 1: rwsem per btree. Simple and effective. To read a btree
block, take a read lock on the rwsem in the btree root struct,
to change the block, take a write lock. On SMP there will be
significant contention under many loads, particularly on the inode
table btree.
Style 2: Crabbing. On a btree probe, take the buffer lock on each
successive block in the patch and release the block on the parent
in the patch after getting the new lock. To split, repeat the
probe but do not release the parent locks. An improvement on this
is to remember the highest parent block that might be split from
the first attempt and only continue to hold parent locks from that
point down. This fairly easy strategy will give a huge
improvement in concurrency. A further improvement is to use
rwsemaphores so that other readers do not have to wait while
reading a child block from disk.
Style 3: Cursors. This is a suggestion from the fertile mind of Matt
Dillon, as interpreted by me. Each btree_path (I would like to
rename this as struct cursor) holds long term shared locks on its
upper N levels. We can therefore read those index blocks with no
locking at all, and because each cursor level points at a cached
block, we do not have to probe a radix tree to find the block.
Now the interesting part is what we do when a block pinned by a
cursor has to be split or merged. There are various workable
stategies including RCU. I will leave decisions on these details
for later, since this is a little ways off.
By the time we get to style 3 we will be scalling well into the
dozens-of-cpus zone. Even style 2 will work pretty well on multi-core
machines that people actually have.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list