[Tux3] Patch : Data Deduplication in Userspace

Christensen Stefan stefan.christensen at uit.no
Wed Feb 25 23:39:18 PST 2009


> -----Original Message-----
> From: Philipp Marek [mailto:philipp.marek at emerion.com] 
> Sent: Thursday, February 26, 2009 8:23 AM
> On Mittwoch, 25. Februar 2009, Chinmay Kamat wrote:
> > We had thought of using a smaller hash function. However 
> the following 
> > issue arises ---  hash value of a block being written to disk is 
> > calculated and compared in the tree index. If a match is 
> found, we are 
> > never sure if the blocks are identical or its a hash 
> collision. So we 
> > need to do a byte by byte comparison of the 2 blocks- current block 
> > being written and the block pointed to by matching tree entry. This 
> > would  mean doing disk read for reading the block pointed 
> to by tree 
> > entry. So each detection of duplicate block will have an 
> overhead of a 
> > block read.

If deduplication is done in times with reduced IO, i don't see the
problem with doing an esktra read for verification. Any hash-value will 
have collisions, and loss of data is an absolute no-no for filesystems.

> 
> I thought about that last night, and came to a similar idea 
> as Michael:
> 
> On Mittwoch, 25. Februar 2009, Michael Keulkeul wrote:
> > If soneone asked me, I would answer than verification is 
> necessary and 
> > using weaker and small hash is fine.
> > Storing the block with it's hash, marked "non deduplicated" 
> is fine, 
> > just dedup it later with a background process when filesystems has 
> > some idle iops to spend on it, and mark it "deduplicated" when done.
> > I've the feeling that tux3 design is neat to do such a 
> thing (multiple 
> > trees, just add one to the forest).
> 
> How about not doing deduplication on *every* block, but only 
> for specially marked files?

This might help the problem by doing ekstra IO to verify that blocks are 
identical. 


On a side note regarding security. There should be no way for any other 
user than root to see if a diskblock is duplicated. Otherwise 
information can leak to persons not allowed to read the files of another 
user.

--
Stefan

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list