[Tux3] Patch : Data Deduplication in Userspace

Christensen Stefan stefan.christensen at uit.no
Wed Feb 25 03:47:50 PST 2009

> -----Original Message-----
> From: Philipp Marek [mailto:philipp.marek at emerion.com] 
> Sent: Wednesday, February 25, 2009 12:31 PM
> That's the question ... if it's "cryptographically secure", 
> it means (AFAIU) that it's "hard" to get collisions ... but 
> it's not impossible.
> Really, it's *guaranteed* that on a large-enough filesystem 
> (some TB, anyone?) you'll get two blocks with the same hash value.
With a 512bit hash value from SHA-2(which is considdered
collision-resistant), you'll probably get a collision roughly
after 2**256 blocks you hash(Birthday paradox). This equates to
an extremely large filesystem(2**218 PiB). By using SHA-1(which has
Some problems, besides is limited size), you'll get a collision
After about 2**80 blocks. 2**80 blocks is still a very large
filesystem(2**42 PiB).

> Therefore I asked whether the risk is acceptable ... there 
> has been some filesystem (I think that was more than 10 years 
> ago, didn't find a link) that tried deduplication by some 
> hash - but got shot down, because without
> *verification* that the data is identical you might 
> *silently* shoot yourself (and all others) in the foot.

By using a large enough hash-value there shouldn't be a problem.
But it might be a filesystem-option.

> Ok.
> But if verification is needed anyway, then something *much* 
> simpler (and
> *much* faster) would be ok, too.

Any hash-function that you'll use have a much shorter calculation
time than any access to rotating media. Even SSD's are slower than
what a mainstream CPU can calculate secure hashes from.

+ Stefan Christensen, Dept. Eng, | Ph:  +47 7764 6406 +
+ IT-department                  | Fax: +47 7764 4100 +

Tux3 mailing list
Tux3 at tux3.org

More information about the Tux3 mailing list