[Tux3] Deferred namespace operations, change return type of fs create method
snitzer at gmail.com
Mon Dec 8 16:38:17 PST 2008
On Mon, Dec 8, 2008 at 5:20 PM, Daniel Phillips <phillips at phunq.net> wrote:
> On Monday 08 December 2008 13:02, Mike Snitzer wrote:
>> 2008/12/8 Daniel Phillips <phillips at phunq.net>:
>> > This updated patch implements an instantiate variant that takes care of
>> > the orphan dirent problem (unlinked while open) by implementing a
>> > variant of d_instantiate that unhashes the orphan and returns a clone of
>> > the open dirent in the rare case that somebody creates a entry of the
>> > same name before the orphan closes:
>> Not to hijack this thread with a general tux3 design question related
>> to orphaned inodes but:
>> In reviewing http://userweb.kernel.org/~hirofumi/tux3/doc/design.html
>> I saw that forward logging should enable:
>> "logging orphan inodes that are unlinked while open, so they can be
>> deleted on replay after a crash."
>> "One traditional nasty case that becomes really nice with logical
>> forward logging is truncate of a gigantic file. We just need to commit
>> a logical update like ['resize', inum, 0] then the inode data truncate
>> can proceed as convenient. Another is orphan inode handling where an
>> open file has been completely unlinked, in which case we log the
>> logical change ['free', inum] then proceed with the actual delete when
>> the file is closed or when the log is replayed after a surprise
>> So putting my distributed filesystem hat on: One unfortunate aspect of
>> ext3 is that orphaned inode processing after a crash blindly deletes
>> all inodes with n_link==0. This is a problem if a remote client
>> application still has the orphaned inode open but the filesystem was
>> unmounted (either forcibly in the case of a Linux crash; or cleanly if
>> write access to the fs was revoked on a given server, e.g. filesystem
>> ownership migrated to another server). It is a problem because the
>> new owning server will re-mount the fs and the conventional orphaned
>> inode processing will cleanup the orphaned inodes out from underneath
>> the remote client application; whereby breaking the application.
>> So my question is, how might tux3 be trained to _not_ cleanup orphaned
>> inodes on re-mount like conventional Linux fileystems? Could a
>> re-mount filter be added that would trap and then somehow reschedule
>> tux3's deferred delete of orphan inodes? This would leave a window of
>> time for an exposed hook to be called (by an upper layer) to
>> reconstitute a reference on each orphaned inode that is still open.
> Something like the NFS silly rename problem. There, the client avoids
> closing a file by renaming it instead, which creates a cleanup problem.
> Something more elegant ought to be possible.
> If the dirent is gone, leaving an orphaned inode, and the filesystem
> has been convinced not to delete the orphan on restart, how would you
> re-open the file? Open by inode number from within kernel?
Well, in a distributed filesystem the server-side may not even have
the notion of open or closed; the client is concerned with such
But yes, some mechanism to read the orphaned inode off the disk into
memory. E.g. iget5_locked, linux gives you enough rope to defeat
n_link==0, combined with a call to read_inode() (ext3_read_inode()
became ext3_iget()). Unfortunately to read orphaned inodes with ext3
that requires clearing the EXT3_ORPHAN_FS flag in the super_block's
It is all quite ugly; and maybe a corner-case that tux3 doesn't
need/want to worry about?
Tux3 mailing list
Tux3 at tux3.org
More information about the Tux3