[Tux3] Design note: Cache Layering Revisited
Daniel Phillips
phillips at phunq.net
Tue Dec 30 01:12:15 PST 2008
In an earlier design note I described the Tux3 cache layering model, and
a pipelining problem related to updating inode table blocks:
http://kerneltrap.org/mailarchive/tux3/2008/11/10/4051294
I proposed a method of deferring namespace filesystem updates until
delta transition, which promises not only to solve the update pipeline
problem, but improve latency of buffered file operations and reduce
contention on the very busy i_mutex locks for under namespace-intensive
loads. But this technique is new and untried, and requires changes to
core kernel. Even though the dentry cache change is quite small, I
would prefer to be able to build as a module without core kernel
patches at this point. And a significant amount of work could be
required to implement this idea, which would distract us from the
immediate task of preparing Tux3 for review.
So we need a workaround, which in my earlier note I suggested would be
to make frontend operations wait for delta staging to complete at
places where the frontend wants to violate the update pipeline order.
But now I have noticed a more efficient and easier workaround that
avoids the waits, and thus avoids front end stalls on delta staging.
The only reason that a frontend inode create wants to update the inode
table block immediately is to avoid choosing the same inode number for
another create. Another way to do this is to remove the store_attrs
call from make_inode, and instead, put a newly created inode onto a
defer list, and consult that list in make_inode to avoid assigning the
same inode number twice:
- Frontend create just puts the inode on a list of inodes for which
itable update is deferred, make_inode checks that list when
choosing a new inode number
- Frontend delete puts inode on list, open_inode checks this list to
verify that the inode being loaded actually exists. Just a cross
check, because the dirent should be gone at that point. Leaving
the inode attibutes in the inode table until after the delta
transition just means that the inode number will not be reused in
the same delta.
- Delta staging applies the deferred inode updates for creates and
deletes.
Observations:
- A linear defer list has n^2 behavior: allocing N inodes requires
n^2 / 2 list node compares. Easily solved in various ways. Even
easier to ignore this until we observe the CPU cost, then fix it.
- Delta staging becomes the only updater of inode table blocks. This
eliminates the update pipeline ordering violation. As a fringe
benefit, the exclusive write lock in make_inode becomes a shared
read lock, improving parallelism of frontend namespace operations.
- Delta staging is able to fork inode table blocks freely, because
these blocks are read-only after staging, and thus read-only in
earlier deltas (more on block forking in an upcoming design note)
- Compared to my earlier proposal to defer inode number assignment,
we assign the inode number immediately on create, but defer
recording it in the inode table. This means that complications
like making NFS handle generation and fstat wait on inode number
assignment are not needed.
In short, I am happy that this little piece fell into place in a way
that supports clean cache layer separation, without having to pioneer
the unexplored territory of deferred namespace operations.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list