[Tux3] Improved prototype of btree node redirect
Daniel Phillips
phillips at phunq.net
Fri Jan 23 02:03:57 PST 2009
The cursor_redirect function redirects a single btree block to a new
physical location and updates the parent pointer, entirely in cache.
It is the heart of Tux3 atomic update.
All btree redirects involve physically mapped buffers in the physical
volume map (volmap). The block is always buffered in a btree cursor,
as part of some btree update. It is being redirected because it is
mapped to a btree block on disk that must not be changed because it is
part of the stable disk image and may have promises logged against it.
A btree block is redirected the first time it is dirtied, and then may
be redirtied any number of times without being redirected, until it is
flushed in a rollup cycle and becomes clean again. Until the
redirected block is flushed, the old physical block may not be freed,
which is handled by the deferred free mechanism. At rollup, all
deferred frees are released into the allocation map. (This is a
separate deferred free list from the per-delta deferred free list,
which handles freed data extents.)
On redirect, a redirect entry is logged giving the old and new physical
locations for the block. All promises made against the block reference
the new physical location. When the redirect entry is encountered on
replay, the old block is loaded into the new volume map position, and
promises are applied to it there.
If the redirected block is a not btree root (level > 0) then it is
updated in the index node buffered one level lower in the cursor, and a
promise is logged to update the parent block on disk. Otherwise, the
parent pointer is in an inode table block or is the root of the inode
table. The inode table root just requires a log entry which will be
discovered on replay.
Changing a tree root in an inode table block is more involved. It is
easy to update the root in the cached inode, but not easy to locate the
associated inode table block, if it even exists yet. A log entry is
made to locate and update the inode table entry at replay time, using
the inode number as a logical key. This requires an additional replay
pass, a slight additional complication at replay time that is better
than probing the btree to update the entry, which would be an
uncomfortable recursion coming from inside a low level btree update.
int cursor_redirect(struct cursor *cursor, unsigned level)
{
struct btree *btree = cursor->btree;
struct sb *sb = btree->sb;
struct buffer_head *buffer = cursor->path[level].buffer;
assert(buffer_clean(buffer));
/* Redirect Block */
block_t oldblock = bufindex(buffer), newblock;
int err = balloc(sb, 1, &newblock);
if (err)
return err;
struct buffer_head *clone = blockget(mapping(sb->volmap), newblock);
if (!clone)
return -ENOMEM; // ERR_PTR me!!!
memcpy(bufdata(clone), bufdata(buffer), bufsize(clone));
mark_buffer_dirty(clone);
cursor->path[level].buffer = clone;
cursor->path[level].next += bufdata(clone) - bufdata(buffer);
log_redirect(sb, oldblock, newblock);
defree(sb, oldblock, 1);
brelse(buffer);
/* Update Parent */
if (level) {
struct index_entry *entry = cursor->path[level - 1].next - 1;
block_t parent = bufindex(cursor->path[level - 1].buffer);
assert(oldblock == from_be_u64(entry->block));
entry->block = to_be_u64(newblock);
log_update(sb, newblock, parent, from_be_u64(entry->key));
return 0;
}
if (btree != &sb->itable) {
struct inode *inode = container_of(btree, struct inode, btree);
assert(oldblock == btree->root.block);
btree->root.block = newblock;
log_droot(sb, newblock, oldblock, inode->inum);
return 0;
}
assert(oldblock == from_be_u64(sb->super.iroot));
log_iroot(sb, newblock, oldblock);
return 0;
}
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list