[Tux3] Patch : Data Deduplication in Userspace

Wed Feb 25 11:20:26 PST 2009

We had thought of using a smaller hash function. However the following
issue arises ---  hash value of a block being written to disk is
calculated and compared in the tree index. If a match is found, we are
never sure if the blocks are identical or its a hash collision. So we
need to do a byte by byte comparison of the 2 blocks- current block
being written and the block pointed to by matching tree entry. This
would  mean doing disk read for reading the block pointed to by tree
entry. So each detection of duplicate block will have an overhead of a
block read.

On the other hand, in case of a larger hash (SHA1) when 2 hash values match,
the blocks should be duplicates, assuming the chances of collision
with large hashes
are very remote. Probably more remote than hw failure.

The tree contains only the first 64 bits of the SHA1 hash,
so there is some optimization. ( we are working on handling collisions here).

About the SHA1 implementation, as Daniel mentioned in his mail, we
will be looking at
implementations other than openssl before the kernel port.

Regards,

Gaurav Tungatkar
Chinmay Kamat
Kushal Dalmia
Amey Magar

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3