[Tux3] Patch : Data Deduplication in Userspace
Philipp Marek
philipp.marek at emerion.com
Wed Feb 25 23:22:57 PST 2009
On Mittwoch, 25. Februar 2009, Chinmay Kamat wrote:
> We had thought of using a smaller hash function. However the following
> issue arises --- hash value of a block being written to disk is
> calculated and compared in the tree index. If a match is found, we are
> never sure if the blocks are identical or its a hash collision. So we
> need to do a byte by byte comparison of the 2 blocks- current block
> being written and the block pointed to by matching tree entry. This
> would mean doing disk read for reading the block pointed to by tree
> entry. So each detection of duplicate block will have an overhead of a
> block read.
I thought about that last night, and came to a similar idea as Michael:
On Mittwoch, 25. Februar 2009, Michael Keulkeul wrote:
> If soneone asked me, I would answer than verification is necessary and
> using weaker and small hash is fine.
> Storing the block with it's hash, marked "non deduplicated" is fine, just
> dedup it later with a background process when filesystems has some idle
> iops to spend on it, and mark it "deduplicated" when done.
> I've the feeling that tux3 design is neat to do such a thing (multiple
> trees, just add one to the forest).
How about not doing deduplication on *every* block, but only for specially
marked files?
Ie. the whole root filesystem is not marked; but if I create a big file that
is to be used as loopback mount, and I want to use several versions of that, I
could mark it for deduplication (reminds me of Copy-On-Write semantics), which
would cause it being indexed in some way.
If I need a copy, I just mark the file to-be-written as "to be de-duplicated",
and the copying process find the duplicate blocks.
(Or maybe there's some special "cowcp" for Tux3, since a real copy only needs
the extents refcounted?)
Then there would be a lot less hash values computed, therefore fewer
collisions, and there would be *much* less read-traffic.
Well, maybe the SSD people would create a new command "compare this data
against block A,B,C and D; if there's a match, return the block number; else
write into block E.".
As SSD has no seek time, this should be (on average) faster than normal
writes.
[ A similar command would be good for RAID sets and online verification, BTW ]
> On the other hand, in case of a larger hash (SHA1) when 2 hash values
> match, the blocks should be duplicates, assuming the chances of collision
> with large hashes
> are very remote. Probably more remote than hw failure.
Well, IMO even remote chances of unseen data damage is bad.
Regards,
Phil
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list