[Tux3] Bug? Atom refcounting redux

Daniel Phillips phillips at phunq.net
Mon Dec 1 20:56:10 PST 2008


On Monday 01 December 2008 18:38, Jim Avera wrote:
> I was randomly reading
> http://tux3.org/pipermail/tux3/2008-September/000186.html
> for pleasure and noticed a possible latent corruption bug in Daniel
> Phillips's post of the atom-refcount-update procedure (below).
> 
> If an i/o error occurs while reading the block containing the upper-16
> bits of refcount, the procedure nevertheless updates the low-16 bits of
> refcount on disk, and then returns an EIO error.   The fix probably
> requires bread-ing both blocks into separate buffers before modifying
> either (only one block if the upper-16 remain zero, of course).
> 
> I don't know anything about tux3, so perhaps some higher-level mechanism
> un-does the incorrect update of the low-16 refcount after the EIO is
> returned.  But if not, a flaky disk (or network link to a disk) might
> result in the refcount being silently reduced by 65535.

Exactly.  If we get EIO on a buffer update we will not commit the
transaction containing the atom update at all, and put the volume into
read-only mode just like Ext3 does.  That is our first line of defense
anyway, we will refine that as we go.  Ultimately, we want to be able to
"power through" a nasty like this and keep going, picking up the pieces
as we can.  If that means walking all the inode table blocks in the
system to recount the xattr atoms, then so be it.  And we can run for
quite some time without tracking atom counts at all, which is nice for
this particular case.  We would map the refcount somewhere else (to a
different physical device hopefully) and continue, counting only the
_new_ reference counts, while gathering up the pre-existing ones from
the atom table.  (To be precise, we track only new updates to the
inodes we have already scanned during this recovery.)

> For reference, here is Daniel's code from the post (without endien-ness
> stuff):

Thanks for the sharp eyes :-)

It's not actually a bug for the reason above, but I did not explained
that before I think
, so until I did it was a bug.

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list