[Tux3] Deferred Namespace Operations
phillips at phunq.net
Mon Nov 10 02:20:34 PST 2008
Cache layering is a central idea in the Tux3 atomic update model. The
cache "front end" consists of data blocks in inode page cache, inode
attributes in inode cache and file names in dentry cache. The
cache "back end" consists of cached file index blocks, inode table
blocks, and inode table index blocks. Applications directly modify the
front end cache via syscalls and memory maps, while the back end cache
is modified only by the filesystem during the process of encoding
changes in cache permanently to disk.
The great promise of such a layering is to allow "bumpless" operation.
So long as sufficient cache memory is available and any needed metadata
blocks have been read into cache, the front end does not need to wait
for the back end to complete its work. It just returns immediately to
its caller after updating a few VFS cache objects, without needing to
locate and update cached disk blocks as well.
In broad outline, the concept is simple, clean and compelling. In
practice, there are issues to overcome. First, some background.
Changes made to front end cache are batched into "deltas", where each
delta comprises all the changes required to represent some set of file
operations carried out by the front end, or equivalently, the changes
required to make the filesystem state of the previous delta represent
the cache state as of the new delta.
Each new delta goes through a "setup" step that selects and assigns disk
addresses for updated data blocks, modifies cached index blocks
accordingly, and creates log blocks to specify index block changes
logically. Following setup, the block images of a delta are
transferred to disk. Finally a delta commit block is written to
transition the Tux3 volume atomically from one consistent state to the
When the front end hands off a delta to the back end it may not further
modify any blocks of that delta until disk transfer has completed. The
concept of cache forking was introduced to avoid stalls where the front
end wishes to rewrite a data block already in flight. The in-flight
block is removed from the block cache and replaced by a copy that the
front end can freely modify.
Block forking eliminates front end waiting for file data blocks, and
because directory blocks are just data blocks, namespace operations
should yield to the same treatment. Unfortunately, this is not the
case because some namespace operations need to operate on inode table
blocks as well as directory entry blocks.
Creating a file involves the following steps:
1) Check the directory in which the file will be created to ensure
it does not already exist.
2) Find a suitable inode number that is not already in use.
3) Mark the corresponding inode as used by filling in at least one
4) Add a directory entry linking the filename to its inode number.
Step (3) changes an inode table block, which might already be part of an
in-flight delta. The inode table block could be forked to allow the
front end to modify it, except that delta setup can also change the
block. An easy way to avoid this collision would be to require front
end operations to wait for delta setup to complete, but this could
cause visible front end stalls, particularly if metadata has to be read
into cache during the setup. Reducing these stalls would require fine
grained synchronization, for which a suitable design is not immediately
An alternate, more aggressive approach is to move all responsibility for
updating inode table blocks to the back end. The back end would choose
inode numbers, not the front end, and because directory entries require
inode numbers, the back end would also take responsibility for creating
them. The back end would then also have to handle deletions, because
the front end cannot without, the kind of synchronization we wish to
avoid here. In effect, all namespace changes would be deferred. This
idea is analogous to deferred allocation for data blocks.
At first blush, this seems like a risky proposition in terms of
complexity. But on investigation the logistics do not seem to be too
daunting. The Linux dentry cache is a big help. Even if a physical
directory entry has not been created yet, a dentry cache entry makes it
appear to front end operations that the file exists. The dentry cache
entry also provides a convenient means of communicating names between
the front and back end. On a file create, for example, the front end
checks that the file does not exist, takes a reference count on the
dentry to prevent it from disappearing, then puts it on a list for
deferred creation. Similarly, on delete the front end takes a count
and puts the dentry on a list for deferred deletion.
To be sure, there are issues. Since a new file has no inode number
until delta setup gets around to choosing one, any operation that
exposes an inode number has to be delayed until one is chosen. The
only two operations I know of affected by this are the fstat syscall
(thanks to Hirofumi for pointing that out) and file handle generation
for NFS. NFS sync mounts have to wait on directory writeout anyway,
which hides the wait on delta setup. There will be some additional
latency with NFS async mounts, but these are seldom used as they are
unsafe. NFS v3 provides safe asynchronous writes via the COMMIT rpc,
where only the first write RPC would see additional latency, while the
COMMIT is required to wait for transfer to stable storage, so the
visible effect should be minimal.
Directory listing cannot practically take account of changes made in
cache that have not yet arrived in directory blocks, so to avoid
surprises such as creating a file and failing to see it in a directory
listing immediately after, listing a directory that has deferred
changes would need to force a delta transition and wait for setup
One has to consider carefully what will happen when a file is repeatedly
created and deleted before any phase transition takes place. On
deletion, the presence of a "negative dentry" shows that the file has
indeed been deleted. According to the thought experiments I have
performed, this just works.
The net complexity increase is probably fairly modest, and likely less
than what would be required to implement fine-grained synchronization
of inode table block updates, while promising superior results. With
deferred namespace operations, Tux3 could achieve performance similar
to RAMFS for loads where many small files are created and/or deleted
without exceeding cache capacity. For heavy, namespace intensive
loads, batching up namespace operations will boost performance
measurably. For example, a number of adjacent available inodes could
be located at the same time, and inode attributes for an entire block
entered at once. The common case of file create followed immediately
by write requires only one update of the inode table block instead of
two. A file that is created and deleted before a delta transition
takes place will not only never appear on disk, it will not even appear
in a cached disk block.
There are also attractive new optimization possibilities. We might
observe that a large number of files have been created in a new
directory, and therefore the allocation goal for the directory should
be relatively far from its parent so it has room to grow, or that a
newly created file has had a large amount of data written to it and
should therefore be allocated well clear of its siblings. Deferred
inode number assignment allows the inode to stay close to the file data
in these cases.
On balance, the rewards seem well worth the effort. But this is
definitely new, unexplored territory, with unforeseen issues likely to
arise. If the worst happens, we can always give up on deferred
namespace operations, fall back to the synchronization approach
described above, and improve it incrementally. But I do not think that
will be necessary. It was not long ago that deferred allocation seemed
like a risky proposition, and now every modern filesystem design
includes it. I expect that things will play out the same way with
deferred namespace operations.
 Note the terminology change: what I formerly called a "phase" is now
a "delta", the latter being existing terminology that accurately
describes the concept.
Tux3 mailing list
Tux3 at tux3.org
More information about the Tux3