[Tux3] Patch: Preliminary attempt at nospace handing

Daniel Phillips phillips at phunq.net
Mon Mar 2 16:48:54 PST 2009


Hi,

I was not really expecting to see people testing out of space handling in Tux3 at this early stage, but Marcin apparently failed to get that memo, so he went ahead and demonstrated directory corruption by hitting nospace in his mass file creation test.  To avoid further embarrassment of this nature, I put together a preliminary patch to fail gracefully in the file create instead of just stumbling forward and running out of disk blocks deep in metadata update.

The classic way of dealing with this messy issue is, at the beginning of each atomic filesystem change one overestimates the worst case number of blocks that will be required to store the change, including all associated metadata blocks, and reserve that many "credits" against the remaining free blocks on the filesystem.  If the remaining free blocks count lies below some safety margin depending on the user privilege and our own confidence that the reservation code is correct, we bail out of the change with ENOSPC, which bubbles up to the application.  If remaining free blocks are sufficient, on the other hand, we press on with the change, counting the number of new blocks actually allocated.  When the change is done (but not necessarily recorded on disk) any unused credit, that is, reservation in excess of actual blocks consumed, is returned to the credit pool.

Well, I was too lazy to implement all the accounting code needed to keep track of actual allocations, and in some cases the actual allocation does not occur inside the change anyway, but is deferred for a later flush operation.  We already have this situation with delayed write allocation and we will soon have more of that with delayed bitmap block flushing.  So rather than do that big, fragile hack right now, I tried a more creative approach.  I make a gross overestimate of number of blocks required to store the transaction as described above, but do not attempt to return unused credits to the credit pool.  Instead, I just let Tux3 "hit the wall" and detect insufficient unreserved free blocks while there still is plenty of actual free space on the volume.  At this point, I force a flush to disk and reset credits to zero, then repeat the reservation check against actual volume free space, now that all earlier reservations have been translated into a concrete on-disk representation.  If this second check fails, it is a real out of space condition and ENOSPC is returned to the application.

Most of the time, the second check does not fail.  However, as the volume gets close to full, the frequency of the flush/recheck fallback cycle increases.  When there are just a handful of blocks left, every change_begin causes a full flush, which does not take very long, because not much cache has been dirtied since the previous flush.

In practice, this appears to work very well.  Normal operation is scarcely affected, because all I have done is add some efficient arithmetic to the change_begin.  We don't take any locks for the check, because the check is conservation.

During a flush triggered by exceeding the credit limit, recursive calls to change_begin do take place, but we need to avoid recursive flushes.  This is handled (hackishly) by negating the sb->margin variable (our safety margin for reservation calculations).  Then any reservation check will succeed if it sees that negative flag, on the theory that the operation is part of a flush and space for it has already been reserved.  Well, this is not actually correct, because new filesystem operations could arrive during the flush, and we want those just to wait instead of bypassing the credit check.  So there is some cleanup work to do, but in practice this works, and fixes Marcin's observed directory corruption.  We need to refine this synchronization strategy to be less of a hack.

One thing that needs to be cleaned up: right now, map_region just fails if it fails to allocate an extent exactly the size it requested from balloc, even though there may be smaller runs of free blocks adding up to the required amount still available on the volume.  We need to implement a fallback to attempt a fragmented allocation if no extent of the ideal size is available, or does not lie within an acceptable distance from the allocation goal.  This is a slightly messy requirement of the extent allocation strategy that is not optional, otherwise fragmentation can cause spurious out of space errors when there is still sufficient space available.  This is probably not even rare when somebody runs a lot of filesystem activity against a nearly full volume, an all too common usage pattern.  Untarring a kernel tree that starts to fail on out of space partway through, generating thousands of attempts to store files in rapid succession, some of them succeeding, is a common scenario that can very quickly find flaws in a broken out of space handling strategy.

I used a really crude technique to flush the cache.  For some reason, none of the VFS machinery for flushing inodes is exposed to modules.  Except for freeze_bdev, which incorporates a flush, among various other odd things.  So I do a freeze_bdev to flush the volume, which locks out further writes to the filesystem as a side effect (but not further metadata operations as far as I can see, maybe I did not look carefully enough).  This is pretty gross, and I am not at all sure that it cannot deadlock.  Our flush really should be a delta commit, not this heavy handed thing that tries to handle all filesystems uniformly without really knowing what it is doing.  However it does seem to work pretty well, and lets us see how nospace handling needs to work, even before the delta commit mechanism is fully in place.

It is pretty cool the way the system behaves as it gets close to full.  Flushes are triggered with accelerating frequency, consuming less free space on eacy cycle.  Finally, we hit true out of space and the application screeches to a halt with an error message, all in good order.  Then, as far as I can see, there is no filesystem corruption as Marcin was able to trigger easily on an unpatched system.  We do need to confirm this by putting together some sort of preliminary fsck, however, things are looking pretty good.  I think we can evolve this off the wall idea into a solid out of space handling mechanism that is light, tight and efficient.

Regards,

Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hack.nospace.patch
Type: text/x-diff
Size: 12856 bytes
Desc: not available
URL: <http://phunq.net/pipermail/tux3/attachments/20090302/2b078471/attachment.patch>
-------------- next part --------------
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3


More information about the Tux3 mailing list