[Tux3] Design note: Cache Layering Revisited

Tue Dec 30 01:12:15 PST 2008

In an earlier design note I described the Tux3 cache layering model, and 
a pipelining problem related to updating inode table blocks:

   http://kerneltrap.org/mailarchive/tux3/2008/11/10/4051294

I proposed a method of deferring namespace filesystem updates until 
delta transition, which promises not only to solve the update pipeline 
problem, but improve latency of buffered file operations and reduce 
contention on the very busy i_mutex locks for under namespace-intensive 
loads.  But this technique is new and untried, and requires changes to 
core kernel.  Even though the dentry cache change is quite small, I 
would prefer to be able to build as a module without core kernel 
patches at this point.  And a significant amount of work could be 
required to implement this idea, which would distract us from the 
immediate task of preparing Tux3 for review.

So we need a workaround, which in my earlier note I suggested would be 
to make frontend operations wait for delta staging to complete at 
places where the frontend wants to violate the update pipeline order.
But now I have noticed a more efficient and easier workaround that 
avoids the waits, and thus avoids front end stalls on delta staging.

The only reason that a frontend inode create wants to update the inode 
table block immediately is to avoid choosing the same inode number for 
another create.  Another way to do this is to remove the store_attrs 
call from make_inode, and instead, put a newly created inode onto a 
defer list, and consult that list in make_inode to avoid assigning the 
same inode number twice:

 - Frontend create just puts the inode on a list of inodes for which
   itable update is deferred, make_inode checks that list when
   choosing a new inode number

 - Frontend delete puts inode on list, open_inode checks this list to
   verify that the inode being loaded actually exists.  Just a cross
   check, because the dirent should be gone at that point.  Leaving
   the inode attibutes in the inode table until after the delta
   transition just means that the inode number will not be reused in
   the same delta.

 - Delta staging applies the deferred inode updates for creates and
   deletes.

Observations:

 - A linear defer list has n^2 behavior: allocing N inodes requires
   n^2 / 2 list node compares.  Easily solved in various ways.  Even
   easier to ignore this until we observe the CPU cost, then fix it.

 - Delta staging becomes the only updater of inode table blocks.  This
   eliminates the update pipeline ordering violation.  As a fringe
   benefit, the exclusive write lock in make_inode becomes a shared
   read lock, improving parallelism of frontend namespace operations.

 - Delta staging is able to fork inode table blocks freely, because
   these blocks are read-only after staging, and thus read-only in
   earlier deltas (more on block forking in an upcoming design note)

 - Compared to my earlier proposal to defer inode number assignment,
   we assign the inode number immediately on create, but defer
   recording it in the inode table.  This means that complications
   like making NFS handle generation and fstat wait on inode number
   assignment are not needed.

In short, I am happy that this little piece fell into place in a way 
that supports clean cache layer separation, without having to pioneer 
the unexplored territory of deferred namespace operations.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3