[Tux3] Patch: Recursive redirect prototype with unit test

Daniel Phillips phillips at phunq.net
Mon Jan 26 00:51:36 PST 2009


Cursor_redirect improvements:

  * Use new_block instead of recoding the new_block functionality.

  * Remove level parameter: always start at the btree leaf and
    redirects clean blocks until a dirty block is found.

cursor_redirect is used every place we make a btree leaf dirty:

  * tree_expand (called by store_attrs)
  * purge_inum (called by tux3_delete_inode)
  * map_region (for create != 0)
  * tree_chop (called by tux3_truncate)

I think that may be all of the call points: one create and one delete
operation for each of inode table and data index.

Cursor_redirect is a simple function, and it is used in a simple way:
at each top level call site where we know we will alter a btree leaf,
cursor_redirect dirties, redirects and (not implemented for kernel yet)
keeps a list of dirty btree blocks.  The only other place redirect is
needed is redirect of data extents in filemap.c, which is simpler
because the mapping is logical so buffers do not have to be relocated.

The next tricky bit of atomic commit to take care of is metadata block
flushing: dleaf and ileaf blocks will be flushed as part of the delta
they were dirtied in, while index nodes will flushed on the longer,
rollup cycle (this is a change in plan that avoids the complexity of
replaying edit operations against leaf nodes for the time being).  So
redirected leaf nodes should go on a different list than redirected
index nodes.

Where a data btree root is redirected, the new root is updated in the
cached inode, which becomes dirty.  All dirty inodes will be flushed
in each delta.  Flushing an inode is just a call to store_attrs, which
dirties and possibly redirects an inode table block.  To put this in
perspective, the first data write to a file will dirty/redirect a path
to the root of the file index tree and dirty the inode.  The first
inode that is flushed at delta staging time will dirty/redirect a path
to the root of the inode table tree.  Additional file writes within the
delta will incur little additional redirect overhead.

Redirecting happens in cache.  Only btree leaf nodes, data extents and
log blocks are written to disk, per delta, while bitmap blocks and
btree index blocks are written out lazily.

It is now possible to say something about performance.  First, the
smallest possible delta: just an inode attribute change.

  1 ileaf
  1 log block
  1 update super

Total, 3 blocks.  Very good.  A more interesting case: create and write
a small file:

  1 dirent block
  1 data block
  1 ileaf
  1 dleaf for the dirent
  1 dleaf for the file
  1 log block
  1 update super

Total, 7 blocks.  Not bad.  The two dleaf writes and the ileaf write
can be optimized away, getting it down to four blocks, and the update
super can be avoided with forward logging, for three blocks.  That would
be really good.

Another interesting case: write a thousand small files.

  1000 data blocks
  1000 dleaf blocks
  16 dirent blocks
  6 ileaf blocks
  5 log blocks
  1 update super

2027 blocks, not as good, it can only achieve less than half of raw
disk write speed.  We need to get rid of the dleaf writes, which only
have a single pointer in each of them.  This will be done by adding a
direct data pointer attribute to the ileaf, which will nearly double
performance for this case.

Seeks come into the equation of course.  Our big win is to be able to
avoid writing allocation maps and inode table index blocks per delta,
which will reduce seeks on write a lot, without increasing seeks on
read.  Having some kind of allocation policy will also help, and before
we have that it will be easy to generate loads that perform horribly.
But we should also be able to show some pretty good performance on a
few important loads right away, without optimizing anything.  For
example, untar of lots of small files should work pretty well, even
with the unnecessary dleaf writes.

diff -r a4a4085c0c07 user/balloc-dummy.c
--- a/user/balloc-dummy.c	Mon Jan 26 12:55:03 2009 +0900
+++ b/user/balloc-dummy.c	Sun Jan 25 23:15:09 2009 -0800
@@ -1,12 +1,13 @@ int balloc(struct sb *sb, unsigned block
 int balloc(struct sb *sb, unsigned blocks, block_t *block)
 {
-	trace("-> %Lx\n", (L)sb->nextalloc);
-	*block = sb->nextalloc += blocks;
+	*block = sb->nextalloc;
+	sb->nextalloc += blocks;
+	trace("-> %Lx/%x", (L)*block, blocks);
 	return 0;
 }
 
 int bfree(struct sb *sb, block_t block, unsigned blocks)
 {
-	trace("<- %Lx, count %x\n", (L)block, blocks);
+	trace("<- %Lx/%x", (L)block, blocks);
 	return 0;
 }
diff -r a4a4085c0c07 user/btree.c
--- a/user/btree.c	Mon Jan 26 12:55:03 2009 +0900
+++ b/user/btree.c	Sun Jan 25 23:15:09 2009 -0800
@@ -12,10 +12,11 @@
 #include "tux3.h"	/* include user/tux3.h, not user/kernel/tux3.h */
 
 #ifndef trace
-#define trace trace_off
+#define trace trace_on
 #endif
 
 #include "balloc-dummy.c"
+#include "kernel/log.c"
 #include "kernel/btree.c"
 
 struct uleaf { u32 magic, count; struct uentry { u16 key, val; } entries[]; };
@@ -102,7 +103,7 @@ void *uleaf_resize(struct btree *btree, 
 		goto out;
 	if (uleaf_free(btree, leaf) < one)
 		return NULL;
-	printf("expand leaf at 0x%x by %i\n", at, one);
+	trace_off("expand leaf at 0x%x by %i", at, one);
 	vecmove(leaf->entries + at + one, leaf->entries + at, leaf->count++ - at);
 out:
 	return leaf->entries + at;
@@ -146,6 +147,7 @@ static void tree_expand_test(struct curs
 	struct uentry *entry = tree_expand(btree, key, 1, cursor);
 	*entry = (struct uentry){ .key = key, .val = key + 0x100 };
 	mark_buffer_dirty(cursor_leafbuf(cursor));
+cursor_redirect(cursor);
 	block_t block = bufindex(cursor_leafbuf(cursor));
 	release_cursor(cursor);
 
@@ -156,11 +158,17 @@ static void tree_expand_test(struct curs
 	release_cursor(cursor);
 }
 
+int errio(struct buffer_head *buffer, int write)
+{
+	return -EINVAL;
+}
+
 int main(int argc, char *argv[])
 {
 	struct dev *dev = &(struct dev){ .bits = 6 };
 	struct sb *sb = &(struct sb){ INIT_SB(dev), };
 	sb->volmap = rapid_open_inode(sb, NULL, 0);
+	sb->logmap = rapid_open_inode(sb, errio, 0);
 	init_buffers(dev, 1 << 20, 0);
 	sb->entries_per_node = (sb->blocksize - offsetof(struct bnode, entries)) / sizeof(struct index_entry);
 	printf("entries_per_node = %i\n", sb->entries_per_node);
@@ -181,12 +189,13 @@ int main(int argc, char *argv[])
 	int until_new_depth = sb->entries_per_node * btree.entries_per_leaf + 1;
 	for (int key = 0; key < until_new_depth; key++)
 		tree_expand_test(cursor, key);
-	show_tree_range(&btree, 0, -1);
+	show_tree(&btree);
+exit(0);
 	tree_chop(&btree, &(struct delete_info){ .key = 0 }, 0);
 
 	for (int key = until_new_depth * 100; key >= 0; key -= 100)
 		tree_expand_test(cursor, key);
-	show_tree_range(&btree, 0, -1);
+	show_tree(&btree);
 	free_cursor(cursor);
 	tree_chop(&btree, &(struct delete_info){ .key = 0 }, 0);
 
diff -r a4a4085c0c07 user/commit.c
--- a/user/commit.c	Mon Jan 26 12:55:03 2009 +0900
+++ b/user/commit.c	Sun Jan 25 23:15:09 2009 -0800
@@ -13,54 +13,6 @@
 #endif
 
 #include "inode.c"
-
-int cursor_redirect(struct cursor *cursor, unsigned level)
-{
-	struct btree *btree = cursor->btree;
-	struct sb *sb = btree->sb;
-	struct buffer_head *buffer = cursor->path[level].buffer;
-	assert(buffer_clean(buffer));
-
-	/* Redirect Block */
-
-	block_t oldblock = bufindex(buffer), newblock;
-	int err = balloc(sb, 1, &newblock);
-	if (err)
-		return err;
-	struct buffer_head *clone = blockget(mapping(sb->volmap), newblock);
-	if (!clone)
-		return -ENOMEM; // ERR_PTR me!!!
-	memcpy(bufdata(clone), bufdata(buffer), bufsize(clone));
-	mark_buffer_dirty(clone);
-	cursor->path[level].buffer = clone;
-	cursor->path[level].next += bufdata(clone) - bufdata(buffer);
-	log_redirect(sb, oldblock, newblock);
-	defer_free(&sb->defree, oldblock, 1);
-	brelse(buffer);
-
-	/* Update Parent */
-
-	if (level) {
-		struct index_entry *entry = cursor->path[level - 1].next - 1;
-		block_t parent = bufindex(cursor->path[level - 1].buffer);
-		assert(oldblock == from_be_u64(entry->block));
-		entry->block = to_be_u64(newblock);
-		log_update(sb, newblock, parent, from_be_u64(entry->key));
-		return 0;
-	}
-
-	if (btree != itable_btree(sb)) {
-		struct inode *inode = btree_inode(btree);
-		assert(oldblock == btree->root.block);
-		btree->root.block = newblock;
-		log_droot(sb, newblock, oldblock, inode->inum);
-		return 0;
-	}
-
-	assert(oldblock == from_be_u64(sb->super.iroot));
-	log_iroot(sb, newblock, oldblock);
-	return 0;
-}
 
 void replay(struct sb *sb)
 {
diff -r a4a4085c0c07 user/kernel/btree.c
--- a/user/kernel/btree.c	Mon Jan 26 12:55:03 2009 +0900
+++ b/user/kernel/btree.c	Sun Jan 25 23:15:09 2009 -0800
@@ -315,6 +315,55 @@ void show_tree(struct btree *btree)
 void show_tree(struct btree *btree)
 {
 	show_tree_range(btree, 0, -1);
+}
+
+int cursor_redirect(struct cursor *cursor)
+{
+	struct btree *btree = cursor->btree;
+	unsigned level = btree->root.depth;
+	struct sb *sb = btree->sb;
+	while (1) {
+		struct buffer_head *buffer = cursor->path[level].buffer;
+//		if (buffer_dirty(buffer))
+//			return 0;
+
+		struct buffer_head *clone = new_block(btree);
+		if (IS_ERR(clone))
+			return PTR_ERR(clone);
+		block_t oldblock = bufindex(buffer), newblock = bufindex(clone);
+		trace("redirect block %Lx to %Lx", oldblock, newblock);
+		memcpy(bufdata(clone), bufdata(buffer), bufsize(clone));
+		cursor->path[level].buffer = clone;
+		cursor->path[level].next += bufdata(clone) - bufdata(buffer);
+		brelse(buffer);
+		log_redirect(sb, oldblock, newblock);
+		defer_free(&sb->defree, oldblock, 1);
+
+		if (!level--) {
+			trace("redirect root");
+			if (btree != itable_btree(sb)) {
+				assert(oldblock == btree->root.block);
+				btree->root.block = newblock;
+				log_droot(sb, newblock, oldblock, tux_inode(btree_inode(btree))->inum);
+				return 0;
+			}
+
+			assert(oldblock == from_be_u64(sb->super.iroot));
+			log_iroot(sb, newblock, oldblock);
+			return 0;
+		}
+
+		trace("update parent");
+		struct index_entry *entry = cursor->path[level].next - 1;
+		block_t parent = bufindex(cursor->path[level].buffer);
+		assert(oldblock == from_be_u64(entry->block));
+		entry->block = to_be_u64(newblock);
+		log_update(sb, newblock, parent, from_be_u64(entry->key));
+		/*
+		 * Note: this may change a clean buffer which is then copied
+		 * and discarded, which seems a little strange but does no harm.
+		 */
+	}
 }
 
 /* Deletion */
@@ -663,8 +712,8 @@ int new_btree(struct btree *btree, struc
 	struct buffer_head *leafbuf = new_leaf(btree);
 	if (IS_ERR(rootbuf) || IS_ERR(leafbuf))
 		goto eek;
-	trace("root at %Lx\n", (L)bufindex(rootbuf));
-	trace("leaf at %Lx\n", (L)bufindex(leafbuf));
+	trace("root at %Lx", (L)bufindex(rootbuf));
+	trace("leaf at %Lx", (L)bufindex(leafbuf));
 	struct bnode *rootnode = bufdata(rootbuf);
 	rootnode->entries[0].block = to_be_u64(bufindex(leafbuf));
 	rootnode->count = to_be_u32(1);
diff -r a4a4085c0c07 user/kernel/tux3.h
--- a/user/kernel/tux3.h	Mon Jan 26 12:55:03 2009 +0900
+++ b/user/kernel/tux3.h	Sun Jan 25 23:15:09 2009 -0800
@@ -731,6 +731,11 @@ static inline int bufcount(struct buffer
 	return atomic_read(&buffer->b_count);
 }
 
+static inline int buffer_clean(struct buffer_head *buffer)
+{
+	return !buffer_dirty(buffer) || buffer_uptodate(buffer);
+}
+
 /* btree.c */
 struct buffer_head *cursor_leafbuf(struct cursor *cursor);
 void release_cursor(struct cursor *cursor);
@@ -827,6 +832,9 @@ unsigned encode_xsize(struct inode *inod
 /* log.c */
 void log_alloc(struct sb *sb, block_t block, unsigned count, unsigned alloc);
 void log_update(struct sb *sb, block_t child, block_t parent, tuxkey_t key);
+void log_redirect(struct sb *sb, block_t newblock, block_t oldblock);
+void log_droot(struct sb *sb, block_t newroot, block_t oldroot, tuxkey_t key);
+void log_iroot(struct sb *sb, block_t newroot, block_t oldroot);
 int defer_free(struct stash *defree, block_t block, unsigned count);
 int retire_frees(struct sb *sb, struct stash *defree);
 void destroy_defree(struct stash *defree);

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list