[Tux3] Deferred delete, second attempt

Daniel Phillips phillips at phunq.net
Wed Dec 3 19:42:17 PST 2008

Hirofumi pointed out that the first version of my patch will fail if a 
file has a dentry open at the time of unlink.  The dcache handles this 
brutally: any dentry with an elevated use count at the time of unlink 
is forcibly removed from the dentry hash and otherwise remains attached 
to the still-open inode and its parent directory.  Eventually the open 
file will close, then the corresponding dput will reduce the dentry 
count to zero and detach the dentry from the inode and parent 

The problem for deferred delete is, we cannot remove the dentry from the 
cache hash like that because we need it to stay around as a negative 
dentry: without the negative dentry a real lookup on the filesystem at 
that point would erroneously report that the name still exists.  The 
user has just done an unlink, so that would be a surprise.

To fix this I introduced a new dentry flag, DCACHE_HIDDEN, to make the 
open dentry appear to be a negative dentry, even while it is still 
attached to an inode.  I added a d_negative(dentry) wrapper that tests 
for both the flag and the absence of an inode for the dentry, the 
latter being the traditional way of marking a dentry negative.  
(Actually, it might make sense to use a state bit for this instead of 
the inode test, but that is another story.)

The new d_negative wrapper had to be laboriously applied to each place
dentry->inode is used as a logical value to determine whether a dentry
is negative.  There are a lot of places where the field is used for
other purposes, and a few where it is hard to tell for sure on a quick 
reading, which I have commented in the patch.

Only fs/namei.c appears to need to be updated with this wrapper.  
Dcache.c itself does not do lookups, and usage in other filesystems is 
confined to those filesystems.  The traditional way of doing things is 
unaffected.  I am not sure about the network filesystems nfs and cifs, 
these may be doing operations on the dentries of the underlying 
filesystem.  I haven't looked into that question yet.

Now, when an open file is unlinked, the ->hide method in ext2 sets the 
HIDDEN bit, takes a reference count to keep the dentry around until the 
deferred unlink takes place, and returns a flag indicating that the 
dentry had to be hidden, which is only necessary in the case that the 
name still exists in the underlying filesystem (DCACHE_BACKED from the 
first version of the patch).  If the deferred delete and the dcache 
itself hold the only references on the dentry, then it will be detached 
from the inode immediately and the inode is scheduled for deletion at 
that time (also deferred).  Otherwise, the dentry is brutally unhashed 
in the traditional way, and eventually will be d_deleted again by the 
flush routine, after the name has been removed from the underlying 
filesystem.  If the file is still open at that time, d_delete falls 
back to the traditional unhash, which now does no harm
because the dcache state matches the filesystem state, so it is no 
longer necessary to hold the dentry in cache.

An additional improvement in this patch is to have ext2's getdents 
method flush the directory first.  This method retrieves its results 
directly from the filesystem, so any pending changes in the dcache have 
to be flushed to the filesystem to get consistent results.

At this point, only two changes have been made to the dentry cache 

  1) A new d_ops.hide method is added.

  2) A new DCACHE_HIDDEN flag marks a dentry as negative while
     still attached to an inode.

Actually, these changes are very small in comparision to the new feature 
supported.  The dcache is now able to function as a writeback cache, 
holding a consistent future state of the filesystem, whereas up till 
now it is functioned as a read-only cache, whose contents exactly match 
the filesystem outside its spinlocks.

So far, this patch only addresses deferred delete.  To be useful for 
Tux3, deferred create and rename have to be implemented as well, 
otherwise the contention on inode table blocks betweeen front end 
operations and back end flushing cannot be eliminated.  I think the 
delete side was probably the hardest part and the only place where the 
dcache needed an extension, but we shall see.

As I mentioned earlier, we will be able to compile with or without this 
dcache extension.  Without the deferred namespace operations we will 
use a somewhat less efficient update method that wraps the delta 
staging operation in a rw semaphore.  When we build as a module, it 
will typically be without the dcache patch.

The following two traces show the sequence of events for a deferred 
unlink.  In the second case, I simulated an open file via a small 
change to d_delete.  The effect is to initiate the deferred inode 
delete a little later.

root at usermode:~# rm /mnt/foo
>>> defer unlink: 0986e9f8/1 48 "foo"
>>> hide dentry: 0986e9f8/1 48 "foo"
>>> defer inode delete: 0988c0b0/0 0
root at usermode:~# ls /mnt
>>> ext2_sync_dir 0984dd30 "/"
>>> dentry: 0986e9f8/1 68 "foo"
>>> drop hidden dentry: 0986e9f8/1 48 "foo"
>>> deferred unlink: 0986e9f8/1 8 "foo"
>>> ext2_sync_dir 0984dd30 "/"
>>> dentry: 0986e9f8/0 8 "foo"
d  dir  lost+found
root at usermode:~# umount /mnt
>>> delete deferred inode: 0988c0b0/0 0
>>> ext2_delete_inode

root at usermode:~# rm /mnt/foo
>>> defer unlink: 0985365c/1 48 "foo"
>>> hide dentry: 0985365c/1 48 "foo"
root at usermode:~# ls /mnt/foo
root at usermode:~# ls /mnt
>>> ext2_sync_dir 09851cac "/"
>>> dentry: 0985365c/1 68 "foo"
>>> drop hidden dentry: 0985365c/1 48 "foo"
>>> deferred unlink: 0985365c/1 8 "foo"
>>> defer inode delete: 0988ae0c/0 0
>>> ext2_sync_dir 09851cac "/"
>>> dentry: 0985365c/0 8 "foo"
d  dir  lost+found
root at usermode:~# umount /mnt
>>> delete deferred inode: 0988ae0c/0 0
>>> ext2_delete_inode
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ext2.defer.unlink.patch
Type: text/x-diff
Size: 15751 bytes
Desc: not available
URL: <http://phunq.net/pipermail/tux3/attachments/20081203/bb90a9c3/attachment.patch>
-------------- next part --------------
Tux3 mailing list
Tux3 at tux3.org

More information about the Tux3 mailing list