[Tux3] Hi, Where can I find the description about phase tree algorithm.
phillips at phunq.net
Sun Aug 3 22:42:49 PDT 2008
On Sunday 03 August 2008 21:06, Liu Hui wrote:
> Thanks for quick response but the link is invalid :(
>From the wayback machine:
And just so we don't lose it again:
The phase tree algorithm
The 'phase tree' algorithm implements a procedure called 'atomic
updating' whereby a filesystem in the process of being modified moves
atomically from one consistent state to the next, modified state
instantly, with no intermediate states at which its internal
structures are inconsistent.
The advantage of atomic updating is that even if processing is
interrupted and the contents of memory are lost a consistent
filesystem will always remain recorded on disk, even if it was in the
process of being updated at the time of interruption. Atomic updating
thus permits the use of write-behind caching without risk of damage to
a filesystem in the event the contents of the cache are lost.
Definition of 'consistent':
Being consistent means that a filesystem has exactly the structure
expected given the filesystem operations that were performed on it.
In particular, all and only the blocks referenced in the filesystem
tree are marked as allocated in the tree's allocation map, and all
internal bookkeeping values such as free block and inode counts are
The phase tree algorithm requires that a filesystem be structured
strictly as a tree so that its entire state is represented by a
single block called the 'metaroot'.
Example filesystem structure:
Two subtrees descend from the metaroot: an inode metafile and a block
allocation metafile. Each metafile consists of a tree of index
blocks with data blocks at the leaves. Each leaf block of the inode
metafile contains some number of inodes, and each leaf block of the
allocation bitmap metafile contains a bitmap indicating the allocation
status of corresponding data blocks. From each inode descends a tree
of index blocks with data blocks at its leaves.
Normally, three filesystem trees exist simultaneously, each with its
own metaroot. One is recorded on disk with a complete, consistent
tree descending from it. A consistent second tree, the 'recording'
tree, in the process of being recorded to disk, descends from a in
metaroot memory, and some of its blocks are in dirty buffers. A third
tree is in the process of being accessed and updated by filesystem
operations, with its metaroot also in memory. This is called the
'branching' tree and is not quite consistent: some of the blocks that
are free in it are not marked as free in its block allocation maps but
instead are held on a 'deferred free' list, or 'defree' list.
Temporary inconsistency may also exist in the branching tree at
intermediate states of file update operations.
During the process of updating the branching tree no alocated block in
the recording tree is ever modified. The process by which this is
accomplished is called 'branching' (see below). The buffered
images of blocks shared by the recorded and branching tree may be
modified, but these buffers are prevented from being written to disk,
so that the recorded tree is not modified.
At some point the recording tree will be fully recorded on disk and
its metaroot can be written to disk so that it replaces the metaroot
of the recorded tree. This causes the filesystem to move atomically
between states, as desired. At this point, the recording tree becomes
the recorded tree, and the branching trees metaroot is copied to become
the new recording tree. This event is called a 'phase transistion' and
the period of time between two such events is called a 'phase'.
The replacement of one metaroot by another is accomplished by
incrementing a counter recorded in the metaroot each time a metaroot is
saved to disk and saving the metaroot to one of a predetermined set of
locations on disk, chosen so that it is not the same as the previously
recorded metaroot. In the event that filesystem processing is
terminated normally the most recently recorded metaroot is copied to
every one of the set of locations. In both cases, each recorded
metaroot carries a checksum to verify that its contents are valid,
since it is not impossible that filesystem processing be interrupted
partway through the writing of a metaroot.
When filesystem processing is initiated, all the recorded metaroots
with valid checksums are compared, and if they are the same then it is
known that filesystem processing terminated normally, otherwise it must
have been interrupted, and this can be reported. If some but not all
of the metaroots are the same then processing is known to have been
interrupted during the process of normal termination. In any case,
the metaroot with the highest counter value (using modular arithmetic)
is taken to be the current metaroot.
Definition: The defree list
A list of blocks that have been freed or branched, but can't be
reallocated until after the next phase transition, because they are
allocated to a file in the previous phase.
The phase tree algorithm can now be described in terms of actions
performed as parts of normal filesystem operations:
To free a block:
Add the block to the defree list if its buffer is clean, otherwise
branch the appropriate bitmap block and toggle the appropriate bit
in the branched bitmap block buffered image.
To branch a block:
Find the block's buffer or if it doesn't have one, create one and
read in the block. If the buffer found is dirty, don't do anything.
Otherwise, Allocate a new block and copy the old block's data to the
new block. Free the old block.
To allocate a block:
Branch the bitmap block that records the allocated status of the
block (2) and mark the block allocated in the new bitmap block. Branch
the metaindex block that points at the old copy of the bitmap and
update the new version to point to the new bitmap block. Continue
this process up to the new metaroot.
To update an inode:
Branch the block that contains the inode and update the copy of the
inode in the new block. Recursively branch the metaindex block that
pointed at the old inode block, and so on, back up to the new
To update a block:
Branch the block and copy in the new data. Recursively branch the
index block that pointed at the old block and update the pointer on
it to point at the new block. Continue this process up to the inode.
To read a block:
Start at the metaroot and follow pointers through metaindex blocks to
find the appropriate inode. Then follow pointers through the
inode's index blocks to find the appropriate data block and copy out
To end a phase and begin the next one:
(*) Block filesystem updating. Branch all allocation bitmap blocks
that map blocks listed in the defree list. Save the new metaroot to a
temporary location. Mark all defreed blocks free in the branched
bitmap. Clear the defree list. Make a list of all dirty buffers,
marking them clean at the same time, and locking them in memory.
Unblock filesystem updating (*) Write all the locked buffers to disk
and unlock them, then replace the old metaroot with the saved one.
Between the two points marked (*) filesystem updating is blocked while
a number of operation are carried out. All these operations are
performed in memory and are efficient. Other than when buffer cache
memory becomes full, the phase tree algorithm does not ever require
filesystem updating to be blocked waiting for disk I/O to complete.
Tux3 mailing list
Tux3 at tux3.org
More information about the Tux3