[Tux3] Improved prototype of btree node redirect

Daniel Phillips phillips at phunq.net
Fri Jan 23 02:03:57 PST 2009


The cursor_redirect function redirects a single btree block to a new 
physical location and updates the parent pointer, entirely in cache.
It is the heart of Tux3 atomic update.

All btree redirects involve physically mapped buffers in the physical 
volume map (volmap).  The block is always buffered in a btree cursor, 
as part of some btree update.  It is being redirected because it is 
mapped to a btree block on disk that must not be changed because it is 
part of the stable disk image and may have promises logged against it.

A btree block is redirected the first time it is dirtied, and then may 
be redirtied any number of times without being redirected, until it is 
flushed in a rollup cycle and becomes clean again.  Until the 
redirected block is flushed, the old physical block may not be freed, 
which is handled by the deferred free mechanism.  At rollup, all 
deferred frees are released into the allocation map.  (This is a 
separate deferred free list from the per-delta deferred free list, 
which handles freed data extents.)

On redirect, a redirect entry is logged giving the old and new physical 
locations for the block.  All promises made against the block reference 
the new physical location.  When the redirect entry is encountered on 
replay, the old block is loaded into the new volume map position, and 
promises are applied to it there.

If the redirected block is a not btree root (level > 0) then it is 
updated in the index node buffered one level lower in the cursor, and a 
promise is logged to update the parent block on disk.  Otherwise, the 
parent pointer is in an inode table block or is the root of the inode 
table.  The inode table root just requires a log entry which will be 
discovered on replay.

Changing a tree root in an inode table block is more involved.  It is 
easy to update the root in the cached inode, but not easy to locate the 
associated inode table block, if it even exists yet.  A log entry is 
made to locate and update the inode table entry at replay time, using 
the inode number as a logical key.  This requires an additional replay 
pass, a slight additional complication at replay time that is better 
than probing the btree to update the entry, which would be an 
uncomfortable recursion coming from inside a low level btree update.

int cursor_redirect(struct cursor *cursor, unsigned level)
{
	struct btree *btree = cursor->btree;
	struct sb *sb = btree->sb;
	struct buffer_head *buffer = cursor->path[level].buffer;
	assert(buffer_clean(buffer));

	/* Redirect Block */

	block_t oldblock = bufindex(buffer), newblock;
	int err = balloc(sb, 1, &newblock);
	if (err)
		return err;
	struct buffer_head *clone = blockget(mapping(sb->volmap), newblock);
	if (!clone)
		return -ENOMEM; // ERR_PTR me!!!
	memcpy(bufdata(clone), bufdata(buffer), bufsize(clone));
	mark_buffer_dirty(clone);
	cursor->path[level].buffer = clone;
	cursor->path[level].next += bufdata(clone) - bufdata(buffer);
	log_redirect(sb, oldblock, newblock);
	defree(sb, oldblock, 1);
	brelse(buffer);

	/* Update Parent */

	if (level) {
		struct index_entry *entry = cursor->path[level - 1].next - 1;
		block_t parent = bufindex(cursor->path[level - 1].buffer);
		assert(oldblock == from_be_u64(entry->block));
		entry->block = to_be_u64(newblock);
		log_update(sb, newblock, parent, from_be_u64(entry->key));
		return 0;
	}

	if (btree != &sb->itable) {
		struct inode *inode = container_of(btree, struct inode, btree);
		assert(oldblock == btree->root.block);
		btree->root.block = newblock;
		log_droot(sb, newblock, oldblock, inode->inum);
		return 0;
	}

	assert(oldblock == from_be_u64(sb->super.iroot));
	log_iroot(sb, newblock, oldblock);
	return 0;
}

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list