[Tux3] Patch: Filesystem change brackets

Daniel Phillips phillips at phunq.net
Mon Jan 5 15:38:25 PST 2009


This patch introduces change brackets: begin_change and end_change,
similar to Ext3's journal_start/journal_stop.  A pair of brackets
groups together all backing store updates needed to bring the
filesystem to a consistent state as expected for the a given
operation.

An end_change checks whether it is time to make a delta transition,
and if so, waits for all concurrent changes to run to completion,
then does the transition, including committing the current filesystem
state to disk.  A begin_change ensures that no commit will occur before
the matching end_change is reached.

This mechanism is described in greater detail here, with skeleton
code:

   http://kerneltrap.org/mailarchive/tux3/2008/12/30/4546684
   Design note: Atomic filesystem changes

(Note the correction in the followup post.)

For now the commit_delta operation in end_change will use a crude
synchronous approach, including the steps:

  - Flush dirty inode table blocks
  - Flush dirty dleaf blocks
  - Flush dirty bitmap blocks
  - Flush log blocks
  - Wait for all transfers to complete
  - Update superblock pointer to new delta commit block

This is just to give an idea what happens in end_change, which should
help in understanding what is going on in the patch below.

Filesystem changes fall into three categories:

  - Name changes
      - Various operations in kernel/namei.c

  - Data changes
      - File truncate in inode.c
      - File write in filemap.c (wrong!!!)

  - Attribute changes
      - Setattr except truncate
      - Extended attributes

Name changes require grouping together the directory change and
updates to the inode table in the same delta.  Data changes require
grouping together the data write, index update and inode attribute
changes including ->i_size in the same delta.  Attribute changes
require grouping together the inode table and atom table changes.

Notes:

  * Operation brackets as currently conceived do not nest.  I am not
    sure whether or not they must nest, this needs pondering.

  * Delta staging and delta commit are big operations that take
    place under ->i_mutex locks, possibly other locks.  We need to
    think carefully about locks that staging and commit take.

  * map_region is not the right place to wrap data writes, this has
    to be done in a whole flock of higher level places.

  * The begin_change operation does not yet check of space availble
    for worst case metadata space required by an operation.  It needs
    to do this, to avoid commit failing on ENOSPC for metadata.

  * Setattr is not done.  The VFS does truncates through setattr,
    creating a horrible little tangle.  Truncate is handled at a
    lower level, and we can leave other setattr attribute updates for
    later.

  * Truncate... this patch assumes it is always possible to do the
    truncate in one commit.  Is it?  Anyway it would be much better if
    all truncate does is log an i_size change, and let the truncate be
    incremental after that.

Currently, begin_change and end_change are just no-ops, which is good
because the usage in map_region below is wrong and will not work: in
map_region, the page for which buffers are being mapped has not been
committed to disk yet, so if a delta commit occurs then, there will be
a small window when some data ill be referenced before it has arrived
on disk.  Correct usage requires wrapping the write operations at a
higher level, which I will attempt in a later patch.  The purpose of
the current patch is to check the usage in the more straightforward
cases of name and xattr operations, and to check that all file
operations are covered.

This is a call for eyeballs!  This is the skeleton of atomic commit,
it needs to be pondered carefully.

diff -r a24f282a8451 user/kernel/filemap.c
--- a/user/kernel/filemap.c	Mon Jan 05 03:05:20 2009 -0800
+++ b/user/kernel/filemap.c	Mon Jan 05 13:44:07 2009 -0800
@@ -20,6 +20,8 @@ void show_segs(struct seg map[], unsigne
 
 static int map_region(struct inode *inode, block_t start, unsigned count, struct seg map[], unsigned max_segs, int create)
 {
+	struct sb *sb = tux_sb(inode->i_sb);
+	begin_change(sb);
 	struct cursor *cursor = alloc_cursor(&tux_inode(inode)->btree, 1); /* allows for depth increase */
 	if (!cursor)
 		return -ENOMEM;
@@ -33,7 +35,6 @@ static int map_region(struct inode *inod
 	block_t limit = start + count;
 	trace("--- index %Lx, limit %Lx ---", (L)start, (L)limit);
 	struct btree *btree = cursor->btree;
-	struct sb *sb = btree->sb;
 	int err, segs = 0;
 
 	if (!btree->root.depth)
@@ -194,6 +195,7 @@ out_unlock:
 	else
 		up_read(&cursor->btree->lock);
 	free_cursor(cursor);
+	end_change(sb);
 	return segs;
 }
 
diff -r a24f282a8451 user/kernel/inode.c
--- a/user/kernel/inode.c	Mon Jan 05 03:05:20 2009 -0800
+++ b/user/kernel/inode.c	Mon Jan 05 13:44:07 2009 -0800
@@ -288,11 +288,13 @@ static void tux3_truncate(struct inode *
 	/* FIXME: must fix expand size */
 	WARN_ON(inode->i_size);
 	block_truncate_page(inode->i_mapping, inode->i_size, tux3_get_block);
+	begin_change(sb);
 	err = tree_chop(&tux_inode(inode)->btree, &del_info, 0);
 	inode->i_blocks = ((inode->i_size + sb->blockmask)
 			   & ~(loff_t)sb->blockmask) >> 9;
 	inode->i_mtime = inode->i_ctime = gettime();
 	mark_inode_dirty(inode);
+	end_change(sb);
 }
 
 void tux3_delete_inode(struct inode *inode)
diff -r a24f282a8451 user/kernel/namei.c
--- a/user/kernel/namei.c	Mon Jan 05 03:05:20 2009 -0800
+++ b/user/kernel/namei.c	Mon Jan 05 13:44:07 2009 -0800
@@ -60,15 +60,21 @@ static int tux3_mknod(struct inode *dir,
 //	if (!huge_valid_dev(rdev))
 //		return -EINVAL;
 
+	begin_change(tux_sb(dir->i_sb));
 	inode = tux_create_inode(dir, mode, rdev);
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		err = tux_add_dirent(dir, dentry, inode);
-		if (!err)
-			return 0;
+		if (!err) {
+			if ((inode->i_mode & S_IFMT) == S_IFDIR)
+				inode_inc_link_count(dir);
+			goto out;
+		}
 		inode_dec_link_count(inode);
 		iput(inode);
 	}
+out:
+	end_change(tux_sb(dir->i_sb));
 	return err;
 }
 
@@ -79,13 +85,9 @@ static int tux3_create(struct inode *dir
 
 static int tux3_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 {
-	int err;
 	if (dir->i_nlink >= TUX_LINK_MAX)
 		return -EMLINK;
-	err = tux3_mknod(dir, dentry, S_IFDIR | mode, 0);
-	if (!err)
-		inode_inc_link_count(dir);
-	return err;
+	return tux3_mknod(dir, dentry, S_IFDIR | mode, 0);
 }
 
 static int tux3_link(struct dentry *old_dentry, struct inode *dir,
@@ -97,6 +99,7 @@ static int tux3_link(struct dentry *old_
 	if (inode->i_nlink >= TUX_LINK_MAX)
 		return -EMLINK;
 
+	begin_change(tux_sb(inode->i_sb));
 	inode->i_ctime = gettime();
 	inode_inc_link_count(inode);
 	atomic_inc(&inode->i_count);
@@ -105,6 +108,7 @@ static int tux3_link(struct dentry *old_
 		inode_dec_link_count(inode);
 		iput(inode);
 	}
+	end_change(tux_sb(inode->i_sb));
 	return err;
 }
 
@@ -114,6 +118,7 @@ static int tux3_symlink(struct inode *di
 	struct inode *inode;
 	int err;
 
+	begin_change(tux_sb(dir->i_sb));
 	inode = tux_create_inode(dir, S_IFLNK | S_IRWXUGO, 0);
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
@@ -121,22 +126,26 @@ static int tux3_symlink(struct inode *di
 		if (!err) {
 			err = tux_add_dirent(dir, dentry, inode);
 			if (!err)
-				return 0;
+				goto out;
 		}
 		inode_dec_link_count(inode);
 		iput(inode);
 	}
+out:
+	end_change(tux_sb(dir->i_sb));
 	return err;
 }
 
 static int tux3_unlink(struct inode *dir, struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
+	begin_change(tux_sb(inode->i_sb));
 	int err = tux_del_dirent(dir, dentry);
 	if (!err) {
 		inode->i_ctime = dir->i_ctime;
 		inode_dec_link_count(inode);
 	}
+	end_change(tux_sb(inode->i_sb));
 	return err;
 }
 
@@ -147,6 +156,7 @@ static int tux3_rmdir(struct inode *dir,
 
 	err = tux_dir_is_empty(inode);
 	if (!err) {
+		begin_change(tux_sb(inode->i_sb));
 		err = tux_del_dirent(dir, dentry);
 		if (!err) {
 			inode->i_ctime = dir->i_ctime;
@@ -155,6 +165,7 @@ static int tux3_rmdir(struct inode *dir,
 			mark_inode_dirty(inode);
 			inode_dec_link_count(dir);
 		}
+		end_change(tux_sb(inode->i_sb));
 	}
 	return err;
 }
@@ -176,6 +187,7 @@ static int tux3_rename(struct inode *old
 	/* FIXME: is this needed? */
 	BUG_ON(from_be_u64(old_entry->inum) != tux_inode(old_inode)->inum);
 
+	begin_change(tux_sb(old_inode->i_sb));
 	if (new_inode) {
 		int old_is_dir = S_ISDIR(old_inode->i_mode);
 		if (old_is_dir) {
@@ -225,9 +237,11 @@ static int tux3_rename(struct inode *old
 	if (!err && new_subdir)
 		inode_dec_link_count(old_dir);
 
+	end_change(tux_sb(old_inode->i_sb));
 	return err;
 
 error:
+	end_change(tux_sb(old_inode->i_sb));
 	brelse(old_buffer);
 	return err;
 }
diff -r a24f282a8451 user/kernel/tux3.h
--- a/user/kernel/tux3.h	Mon Jan 05 03:05:20 2009 -0800
+++ b/user/kernel/tux3.h	Mon Jan 05 13:44:07 2009 -0800
@@ -726,4 +726,6 @@ static inline struct inode *buffer_inode
 }
 #endif /* !__KERNEL__ */
 
+static inline void begin_change(struct sb *sb) { };
+static inline void end_change(struct sb *sb) { };
 #endif
diff -r a24f282a8451 user/kernel/xattr.c
--- a/user/kernel/xattr.c	Mon Jan 05 03:05:20 2009 -0800
+++ b/user/kernel/xattr.c	Mon Jan 05 13:44:07 2009 -0800
@@ -377,9 +377,11 @@ int set_xattr(struct inode *inode, const
 {
 	struct inode *atable = tux_sb(inode->i_sb)->atable;
 	mutex_lock(&atable->i_mutex);
+	begin_change(tux_sb(inode->i_sb));
 	atom_t atom = make_atom(atable, name, len);
 	int err = (atom == -1) ? -EINVAL :
 		xcache_update(inode, atom, data, size, flags);
+	end_change(tux_sb(inode->i_sb));
 	mutex_unlock(&atable->i_mutex);
 	return err;
 }
@@ -389,6 +391,7 @@ int del_xattr(struct inode *inode, const
 	int err = 0;
 	struct inode *atable = tux_sb(inode->i_sb)->atable;
 	mutex_lock(&atable->i_mutex);
+	begin_change(tux_sb(inode->i_sb));
 	atom_t atom = find_atom(atable, name, len);
 	if (atom == -1) {
 		err = -ENOATTR;
@@ -404,6 +407,7 @@ int del_xattr(struct inode *inode, const
 	if (used)
 		use_atom(atable, atom, -used);
 out:
+	end_change(tux_sb(inode->i_sb));
 	mutex_unlock(&atable->i_mutex);
 	return err;
 }

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list