[Tux3] On avoiding deadlock when allocating space (and handling quota)

Sun Aug 3 23:17:04 PDT 2008

Hi Antonio,

Sometimes the most interesting mails take the most time to respond to.

On Saturday 02 August 2008 03:45, Antonio Vargas wrote:
> Hi Daniel, your first mention about managing to do deadlock-free quotas
> reminded me of the method I used on my userspace prototype of tux2.
> 
> When implementing the physical structure of the metatree, the metaroot
> replicas contained the first 16 or 32 inodes which were reserved for
> internal use. So, using that scheme, inode 1 had the free bitmap as the
> metatree, inode 2 had a list of blocks ready to use and inode 3 had as
> datablocks the rest of the inode table.
> 
> So, during mkfs time not only the blocks used for metadata were marked as
> used on the free bitmap, but also some blocks not yet used for actual data
> would be marked as used in the bitmap but would get added to inode 2 as a
> list of blocks ready to be allocated without having to touch the free
> bitmaps.
> 
> Then, when starting an operation which could potentially need many new
> blocks due to COW of the metadata trees, I would first of all ask for a
> upper bound of blocks to get reserved. These blocks would get reserved for
> this operation and be taken out of the ready-to-alloc list on memory, and
> any COW operations later needed would simply grab a block from that
> operation-private list in memory.
> 
> When an operation finished, any blocks which had been reserved would be
> returned to the ready-to-allocate list, and when a phase finished it would
> first initiate an internal-only operation to replenish that
> ready-to-allocate list from the bitmaps, using the previously existing
> ready-to-allocate blocks as a way to avoid needing to modify the free bitmap
> while it was being updated to get more blocks out of it.
> 
> Of course this means that the maximum length of a phase would be capped by
> the number of ready-to-alloc blocks we always kept already reserved on the
> free bitmap. But it also meant that deciding when to write a phase to disk
> was easy: when the number of in-memory ready-to-allocate blocks were
> reaching a number small enough to be unable to guarantee a worst case COW
> process on the bitmap tree.
> 
> I think these same methods could be adapted to work with extents for tux3,
> and can enable a deadlock free implementation of many operations without
> having to do much explicit work in each of the operations.

I have a similar idea in mind.  It is desirable to avoid depending on
specific pre-allocated blocks because that limits the choice of where
a block may be allocated, potentially causing increased fragmentation.
Instead, we intentionally leave allocation a little loose, so that we
do not try to use up every last free block in a given target region
before deciding we really should target some other region for the file
we are trying to write.  With maybe 1% of blocks free randomly all
over the filesystem, when we do need a few of them to make a commit we
will likely be able to find something pretty close to the related data.
This will work particularly well for update in place kinds of commits
where the extra data allocation can be released once the log rollup has
progressed past that log transaction.

Another nice tool we have in the toolchest this time round is, we can
logically log an allocation instead of actually updating the free map.
A log transaction implies a temporary allocation, and we will ensure
that the blocks used by the log transaction are not given out by the
allocator until the transaction has been completely retired.

Tux3 has phases too, where a phase is a group of blocks dirtied by
some group of filesystem transactions, which will all be committed to
disk in one atomic transaction.  Two things determine how big a phase
can be: the amount of cache memory available and the amount of free
space on the volume.  As the volume approaches full, the size of a
phase will be reduced.  I am not sure now to know how much cache is
available, but that should also affect maximum phase size.  I think
that will just be a magic number to start, later make it a tunable,
then one day figure out a sensible interface whereby various tasks
competing for resources can somehow establish a consensus on how much
cache memory each should be trying to use.  A research project.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3