[Tux3] Deferred delete, second attempt
Daniel Phillips
phillips at phunq.net
Wed Dec 3 19:42:17 PST 2008
Hirofumi pointed out that the first version of my patch will fail if a
file has a dentry open at the time of unlink. The dcache handles this
brutally: any dentry with an elevated use count at the time of unlink
is forcibly removed from the dentry hash and otherwise remains attached
to the still-open inode and its parent directory. Eventually the open
file will close, then the corresponding dput will reduce the dentry
count to zero and detach the dentry from the inode and parent
directory.
The problem for deferred delete is, we cannot remove the dentry from the
cache hash like that because we need it to stay around as a negative
dentry: without the negative dentry a real lookup on the filesystem at
that point would erroneously report that the name still exists. The
user has just done an unlink, so that would be a surprise.
To fix this I introduced a new dentry flag, DCACHE_HIDDEN, to make the
open dentry appear to be a negative dentry, even while it is still
attached to an inode. I added a d_negative(dentry) wrapper that tests
for both the flag and the absence of an inode for the dentry, the
latter being the traditional way of marking a dentry negative.
(Actually, it might make sense to use a state bit for this instead of
the inode test, but that is another story.)
The new d_negative wrapper had to be laboriously applied to each place
dentry->inode is used as a logical value to determine whether a dentry
is negative. There are a lot of places where the field is used for
other purposes, and a few where it is hard to tell for sure on a quick
reading, which I have commented in the patch.
Only fs/namei.c appears to need to be updated with this wrapper.
Dcache.c itself does not do lookups, and usage in other filesystems is
confined to those filesystems. The traditional way of doing things is
unaffected. I am not sure about the network filesystems nfs and cifs,
these may be doing operations on the dentries of the underlying
filesystem. I haven't looked into that question yet.
Now, when an open file is unlinked, the ->hide method in ext2 sets the
HIDDEN bit, takes a reference count to keep the dentry around until the
deferred unlink takes place, and returns a flag indicating that the
dentry had to be hidden, which is only necessary in the case that the
name still exists in the underlying filesystem (DCACHE_BACKED from the
first version of the patch). If the deferred delete and the dcache
itself hold the only references on the dentry, then it will be detached
from the inode immediately and the inode is scheduled for deletion at
that time (also deferred). Otherwise, the dentry is brutally unhashed
in the traditional way, and eventually will be d_deleted again by the
flush routine, after the name has been removed from the underlying
filesystem. If the file is still open at that time, d_delete falls
back to the traditional unhash, which now does no harm
because the dcache state matches the filesystem state, so it is no
longer necessary to hold the dentry in cache.
An additional improvement in this patch is to have ext2's getdents
method flush the directory first. This method retrieves its results
directly from the filesystem, so any pending changes in the dcache have
to be flushed to the filesystem to get consistent results.
At this point, only two changes have been made to the dentry cache
itself:
1) A new d_ops.hide method is added.
2) A new DCACHE_HIDDEN flag marks a dentry as negative while
still attached to an inode.
Actually, these changes are very small in comparision to the new feature
supported. The dcache is now able to function as a writeback cache,
holding a consistent future state of the filesystem, whereas up till
now it is functioned as a read-only cache, whose contents exactly match
the filesystem outside its spinlocks.
So far, this patch only addresses deferred delete. To be useful for
Tux3, deferred create and rename have to be implemented as well,
otherwise the contention on inode table blocks betweeen front end
operations and back end flushing cannot be eliminated. I think the
delete side was probably the hardest part and the only place where the
dcache needed an extension, but we shall see.
As I mentioned earlier, we will be able to compile with or without this
dcache extension. Without the deferred namespace operations we will
use a somewhat less efficient update method that wraps the delta
staging operation in a rw semaphore. When we build as a module, it
will typically be without the dcache patch.
The following two traces show the sequence of events for a deferred
unlink. In the second case, I simulated an open file via a small
change to d_delete. The effect is to initiate the deferred inode
delete a little later.
root at usermode:~# rm /mnt/foo
>>> defer unlink: 0986e9f8/1 48 "foo"
>>> hide dentry: 0986e9f8/1 48 "foo"
>>> defer inode delete: 0988c0b0/0 0
root at usermode:~# ls /mnt
>>> ext2_sync_dir 0984dd30 "/"
>>> dentry: 0986e9f8/1 68 "foo"
>>> drop hidden dentry: 0986e9f8/1 48 "foo"
>>> deferred unlink: 0986e9f8/1 8 "foo"
>>> ext2_sync_dir 0984dd30 "/"
>>> dentry: 0986e9f8/0 8 "foo"
d dir lost+found
root at usermode:~# umount /mnt
>>> delete deferred inode: 0988c0b0/0 0
>>> ext2_delete_inode
root at usermode:~# rm /mnt/foo
>>> defer unlink: 0985365c/1 48 "foo"
>>> hide dentry: 0985365c/1 48 "foo"
root at usermode:~# ls /mnt/foo
/mnt/foo
root at usermode:~# ls /mnt
>>> ext2_sync_dir 09851cac "/"
>>> dentry: 0985365c/1 68 "foo"
>>> drop hidden dentry: 0985365c/1 48 "foo"
>>> deferred unlink: 0985365c/1 8 "foo"
>>> defer inode delete: 0988ae0c/0 0
>>> ext2_sync_dir 09851cac "/"
>>> dentry: 0985365c/0 8 "foo"
d dir lost+found
root at usermode:~# umount /mnt
>>> delete deferred inode: 0988ae0c/0 0
>>> ext2_delete_inode
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ext2.defer.unlink.patch
Type: text/x-diff
Size: 15751 bytes
Desc: not available
URL: <http://phunq.net/pipermail/tux3/attachments/20081203/bb90a9c3/attachment.patch>
-------------- next part --------------
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list