[Tux3] Patch : Data Deduplication in Userspace
Christensen Stefan
stefan.christensen at uit.no
Wed Feb 25 23:39:18 PST 2009
> -----Original Message-----
> From: Philipp Marek [mailto:philipp.marek at emerion.com]
> Sent: Thursday, February 26, 2009 8:23 AM
> On Mittwoch, 25. Februar 2009, Chinmay Kamat wrote:
> > We had thought of using a smaller hash function. However
> the following
> > issue arises --- hash value of a block being written to disk is
> > calculated and compared in the tree index. If a match is
> found, we are
> > never sure if the blocks are identical or its a hash
> collision. So we
> > need to do a byte by byte comparison of the 2 blocks- current block
> > being written and the block pointed to by matching tree entry. This
> > would mean doing disk read for reading the block pointed
> to by tree
> > entry. So each detection of duplicate block will have an
> overhead of a
> > block read.
If deduplication is done in times with reduced IO, i don't see the
problem with doing an esktra read for verification. Any hash-value will
have collisions, and loss of data is an absolute no-no for filesystems.
>
> I thought about that last night, and came to a similar idea
> as Michael:
>
> On Mittwoch, 25. Februar 2009, Michael Keulkeul wrote:
> > If soneone asked me, I would answer than verification is
> necessary and
> > using weaker and small hash is fine.
> > Storing the block with it's hash, marked "non deduplicated"
> is fine,
> > just dedup it later with a background process when filesystems has
> > some idle iops to spend on it, and mark it "deduplicated" when done.
> > I've the feeling that tux3 design is neat to do such a
> thing (multiple
> > trees, just add one to the forest).
>
> How about not doing deduplication on *every* block, but only
> for specially marked files?
This might help the problem by doing ekstra IO to verify that blocks are
identical.
On a side note regarding security. There should be no way for any other
user than root to see if a diskblock is duplicated. Otherwise
information can leak to persons not allowed to read the files of another
user.
--
Stefan
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list