[Tux3] Deferred Namespace Operations

Mon Nov 10 02:20:34 PST 2008

Cache layering is a central idea in the Tux3 atomic update model.  The 
cache "front end" consists of data blocks in inode page cache, inode 
attributes in inode cache and file names in dentry cache.  The 
cache "back end" consists of cached file index blocks, inode table 
blocks, and inode table index blocks.  Applications directly modify the 
front end cache via syscalls and memory maps, while the back end cache 
is modified only by the filesystem during the process of encoding 
changes in cache permanently to disk.

The great promise of such a layering is to allow "bumpless" operation.  
So long as sufficient cache memory is available and any needed metadata 
blocks have been read into cache, the front end does not need to wait 
for the back end to complete its work.  It just returns immediately to 
its caller after updating a few VFS cache objects, without needing to 
locate and update cached disk blocks as well.

In broad outline, the concept is simple, clean and compelling.  In 
practice, there are issues to overcome.  First, some background.

Changes made to front end cache are batched into "deltas"[1], where each 
delta comprises all the changes required to represent some set of file 
operations carried out by the front end, or equivalently, the changes 
required to make the filesystem state of the previous delta represent 
the cache state as of the new delta.

Each new delta goes through a "setup" step that selects and assigns disk 
addresses for updated data blocks, modifies cached index blocks 
accordingly, and creates log blocks to specify index block changes 
logically.  Following setup, the block images of a delta are 
transferred to disk.  Finally a delta commit block is written to 
transition the Tux3 volume atomically from one consistent state to the 
next.

When the front end hands off a delta to the back end it may not further 
modify any blocks of that delta until disk transfer has completed.  The 
concept of cache forking was introduced to avoid stalls where the front 
end wishes to rewrite a data block already in flight.  The in-flight 
block is removed from the block cache and replaced by a copy that the 
front end can freely modify.

Block forking eliminates front end waiting for file data blocks, and 
because directory blocks are just data blocks, namespace operations 
should yield to the same treatment.  Unfortunately, this is not the 
case because some namespace operations need to operate on inode table 
blocks as well as directory entry blocks.

Creating a file involves the following steps:

  1) Check the directory in which the file will be created to ensure
     it does not already exist.

  2) Find a suitable inode number that is not already in use.

  3) Mark the corresponding inode as used by filling in at least one
     inode attribute.

  4) Add a directory entry linking the filename to its inode number.

Step (3) changes an inode table block, which might already be part of an 
in-flight delta.  The inode table block could be forked to allow the 
front end to modify it, except that delta setup can also change the 
block.  An easy way to avoid this collision would be to require front 
end operations to wait for delta setup to complete, but this could 
cause visible front end stalls, particularly if metadata has to be read 
into cache during the setup.  Reducing these stalls would require fine 
grained synchronization, for which a suitable design is not immediately 
obvious.

An alternate, more aggressive approach is to move all responsibility for 
updating inode table blocks to the back end.  The back end would choose 
inode numbers, not the front end, and because directory entries require 
inode numbers, the back end would also take responsibility for creating 
them.  The back end would then also have to handle deletions, because 
the front end cannot without, the kind of synchronization we wish to 
avoid here.  In effect, all namespace changes would be deferred.  This 
idea is analogous to deferred allocation for data blocks.

At first blush, this seems like a risky proposition in terms of 
complexity.  But on investigation the logistics do not seem to be too 
daunting.  The Linux dentry cache is a big help.  Even if a physical 
directory entry has not been created yet, a dentry cache entry makes it 
appear to front end operations that the file exists.  The dentry cache 
entry also provides a convenient means of communicating names between 
the front and back end.  On a file create, for example, the front end 
checks that the file does not exist, takes a reference count on the 
dentry to prevent it from disappearing, then puts it on a list for 
deferred creation.  Similarly, on delete the front end takes a count 
and puts the dentry on a list for deferred deletion.

To be sure, there are issues.  Since a new file has no inode number 
until delta setup gets around to choosing one, any operation that 
exposes an inode number has to be delayed until one is chosen.  The 
only two operations I know of affected by this are the fstat syscall 
(thanks to Hirofumi for pointing that out) and file handle generation 
for NFS.  NFS sync mounts have to wait on directory writeout anyway, 
which hides the wait on delta setup.  There will be some additional 
latency with NFS async mounts, but these are seldom used as they are 
unsafe.  NFS v3 provides safe asynchronous writes via the COMMIT rpc, 
where only the first write RPC would see additional latency, while the 
COMMIT is required to wait for transfer to stable storage, so the 
visible effect should be minimal.

Directory listing cannot practically take account of changes made in 
cache that have not yet arrived in directory blocks, so to avoid 
surprises such as creating a file and failing to see it in a directory 
listing immediately after, listing a directory that has deferred 
changes would need to force a delta transition and wait for setup 
completion.

One has to consider carefully what will happen when a file is repeatedly 
created and deleted before any phase transition takes place.  On 
deletion, the presence of a "negative dentry" shows that the file has 
indeed been deleted.  According to the thought experiments I have 
performed, this just works.

The net complexity increase is probably fairly modest, and likely less 
than what would be required to implement fine-grained synchronization 
of inode table block updates, while promising superior results.  With 
deferred namespace operations, Tux3 could achieve performance similar 
to RAMFS for loads where many small files are created and/or deleted 
without exceeding cache capacity.  For heavy, namespace intensive 
loads, batching up namespace operations will boost performance 
measurably.  For example, a number of adjacent available inodes could 
be located at the same time, and inode attributes for an entire block 
entered at once.  The common case of file create followed immediately 
by write requires only one update of the inode table block instead of 
two.  A file that is created and deleted before a delta transition 
takes place will not only never appear on disk, it will not even appear 
in a cached disk block.

There are also attractive new optimization possibilities.  We might 
observe that a large number of files have been created in a new 
directory, and therefore the allocation goal for the directory should 
be relatively far from its parent so it has room to grow, or that a 
newly created file has had a large amount of data written to it and 
should therefore be allocated well clear of its siblings.  Deferred 
inode number assignment allows the inode to stay close to the file data 
in these cases.

On balance, the rewards seem well worth the effort.  But this is 
definitely new, unexplored territory, with unforeseen issues likely to 
arise.  If the worst happens, we can always give up on deferred 
namespace operations, fall back to the synchronization approach 
described above, and improve it incrementally.  But I do not think that 
will be necessary.  It was not long ago that deferred allocation seemed 
like a risky proposition, and now every modern filesystem design 
includes it.  I expect that things will play out the same way with 
deferred namespace operations.

[1] Note the terminology change: what I formerly called a "phase" is now 
a "delta", the latter being existing terminology that accurately 
describes the concept.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3