[Tux3] Patch : Data Deduplication in Userspace

Philipp Marek philipp.marek at emerion.com
Wed Feb 25 23:22:57 PST 2009

On Mittwoch, 25. Februar 2009, Chinmay Kamat wrote:
> We had thought of using a smaller hash function. However the following
> issue arises ---  hash value of a block being written to disk is
> calculated and compared in the tree index. If a match is found, we are
> never sure if the blocks are identical or its a hash collision. So we
> need to do a byte by byte comparison of the 2 blocks- current block
> being written and the block pointed to by matching tree entry. This
> would  mean doing disk read for reading the block pointed to by tree
> entry. So each detection of duplicate block will have an overhead of a
> block read.

I thought about that last night, and came to a similar idea as Michael:

On Mittwoch, 25. Februar 2009, Michael Keulkeul wrote:
> If soneone asked me, I would answer than verification is necessary and
> using weaker and small hash is fine.
> Storing the block with it's hash, marked "non deduplicated" is fine, just
> dedup it later with a background process when filesystems has some idle
> iops to spend on it, and mark it "deduplicated" when done.
> I've the feeling that tux3 design is neat to do such a thing (multiple
> trees, just add one to the forest).

How about not doing deduplication on *every* block, but only for specially 
marked files?

Ie. the whole root filesystem is not marked; but if I create a big file that 
is to be used as loopback mount, and I want to use several versions of that, I 
could mark it for deduplication (reminds me of Copy-On-Write semantics), which 
would cause it being indexed in some way.

If I need a copy, I just mark the file to-be-written as "to be de-duplicated", 
and the copying process find the duplicate blocks.
(Or maybe there's some special "cowcp" for Tux3, since a real copy only needs 
the extents refcounted?)

Then there would be a lot less hash values computed, therefore fewer 
collisions, and there would be *much* less read-traffic.

Well, maybe the SSD people would create a new command "compare this data 
against block A,B,C and D; if there's a match, return the block number; else 
write into block E.".
As SSD has no seek time, this should be (on average) faster than normal 

[ A similar command would be good for RAID sets and online verification, BTW ]

> On the other hand, in case of a larger hash (SHA1) when 2 hash values
> match, the blocks should be duplicates, assuming the chances of collision
> with large hashes
> are very remote. Probably more remote than hw failure.
Well, IMO even remote chances of unseen data damage is bad.



Tux3 mailing list
Tux3 at tux3.org

More information about the Tux3 mailing list