[Tux3] Draft kernel version of buffer fork

Tue Jan 13 21:12:25 PST 2009

For comment.  This compiles and seems to make sense.   I have not tried
it at all.  There are few stubs to fill in below: I need to borrow a
couple of bits from the buffer state flags for the delta state, and set
up an array of delta dirty lists.

The underlying page of the buffer being forked is brought uptodate and
locked using the mapping ->readpage method.  The reason for this is to
be sure there are no asynchronous reads in progress for other blocks on
the page, which would leave their data on the old copy of the page
instead of the new page we will substitute into the page cache.

In the case of reading from the volume, ->readpage does not know which
blocks are free or mapped to files, so it may read a few extra blocks
into cache, which should not matter because there is no extra seeking,
and sometimes it will act as a kind of physical readahead by bringing
several block into cache at once.  Also, the read_mapping_page can be
skipped when block size is the same as page size, the common case.  And
forking should be relatively rare for physically mapped metadata.
Logging will confine most physical metadata updates to rollups, between
which a number of deltas will complete, which clean dirty blocks and
eliminate the need to fork.  At that point, forking will just be
handling rare corner cases.

Parallel writes to other blocks on the same page are allowed.  Before
writing to buffer data, a user must call blockdirty, which will fork
all dirty blocks on the same page, under the page lock.  While waiting
for the page lock, another task could fork the page, bringing our
buffer dirty state into the current delta and allowing an early exit.

Next, A page for the fork is allocated and buffers placed on it.  The
two per-page buffer lists are then walked together.  Any dirty buffer
must be dirty in a previous delta, otherwise the buffer being forked
would already belong to the current delta, so we assert this.  The old
buffer, now pointing at the copy of the page, moves from the earlier
delta dirty buffer list it was on to the inode dirty list, for later
writeout, and the new buffer, pointing at the original page, takes its
place on the delta dirty list.  Spinlocking needs to be added to
protect both the per-inode and delta dirty lists.

The list spinlocking will just protect the integrity of the list
itself, not the state of buffers on it.  Higher level synchronization
is required for that.  For example, in our immediate writeout mode of
atomic commit, when the delta_lock is held during delta transition,
there will be no more forking caused by filesystem operations, because
no operations are in progress.  This gets more complex when we allow
more than one delta in the update pipeline, and I am not sure of all
the details yet.

For now we have a much simpler situation: at delta transition,
forking stops except for the synchronous forking in the bitmap flush.
This is not much different from the single tasking situation in user
space, so it should not be too hard to get this to work and use the
same technique in kernel.  Over time we can consider the issues that
come up with more asynchronous usage.

This should work either in our new volume mapping or a file cache
mapping, such as a directory.

#define bufdelta(x) 0
#define set_bufdelta(x, y) 0
#define delta_list(x) NULL

int fork_buffer(struct buffer_head *buffer)
{
	struct page *oldpage = buffer->b_page;
        struct address_space *mapping = oldpage->mapping;
	struct inode *inode = mapping->host;
	struct sb *sb = tux_sb(inode->i_sb);
	tuxnode_t *tuxnode = tux_inode(inode);
	unsigned newdelta = sb->delta & DELTA_MASK;
	struct list_head *inode_dirty_list = &tuxnode->dirty;
	unsigned blocksize = sb->blocksize;

	// Use read_mapping_page to bring the full page uptodate
	// Take the page lock (protects the buffer list)
	lock_page(oldpage);
	while (!PageUptodate(oldpage)) {
		unlock_page(oldpage);
		page = read_mapping_page(mapping, oldpage->index, NULL);
		lock_page(oldpage);
	}

	// The fork happened while waiting for the page lock?
	if (bufdelta(buffer) == newdelta) {
		unlock_page(oldpage);
		return 0;
	}

	// Allocate a new page and put buffers on it
	struct page *newpage = alloc_pages(0, GFP_KERNEL);
	create_empty_buffers(newpage, blocksize, 0);

	// Copy page data
	memcpy(page_address(newpage), page_address(oldpage), PAGE_CACHE_SIZE);

	// Walk the two buffer lists together
	struct buffer_head *oldbuf = (void *)oldpage->private, *oldlist = oldbuf;
	struct buffer_head *newbuf = (void *)newpage->private;
	do {
		newbuf->b_state = oldbuf->b_state & (BH_Uptodate | BH_Dirty);
		newbuf->b_page = oldpage;
		oldbuf->b_page = newpage;
		if (buffer_dirty(oldbuf)) {
			unsigned olddelta = bufdelta(oldbuf);
			assert(olddelta != newdelta);

			// Set old buffer dirty in the current delta
			list_move_tail(&oldbuf->b_assoc_buffers, inode_dirty_list);
			set_bufdelta(oldbuf, newdelta);

			// Add new buffer to the earlier delta list
			list_move_tail(&newbuf->b_assoc_buffers, delta_list(newdelta));
			set_bufdelta(newbuf, olddelta);
		}
		oldbuf = oldbuf->b_this_page;
		newbuf = newbuf->b_this_page;
	} while (oldbuf != oldlist);

	// Swap the page buffer lists
	oldpage->private = newpage->private;
	newpage->private = (unsigned long)oldlist;
	newpage->index = oldpage->index;

	// Replace page in radix tree
        spin_lock_irq(&mapping->tree_lock);
        void **slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
        radix_tree_replace_slot(slot, newpage);
        spin_unlock_irq(&mapping->tree_lock);
        get_page(newpage);
        put_page(oldpage);
	unlock_page(oldpage);
	return 0;
}

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3