[Tux3] Patch : Data Deduplication in Userspace

Philipp Marek philipp.marek at emerion.com
Wed Feb 25 04:53:57 PST 2009


On Mittwoch, 25. Februar 2009, Christensen Stefan wrote:
> > -----Original Message-----
> > From: Philipp Marek [mailto:philipp.marek at emerion.com]
> > Sent: Wednesday, February 25, 2009 12:31 PM
> > That's the question ... if it's "cryptographically secure",
> > it means (AFAIU) that it's "hard" to get collisions ... but
> > it's not impossible.
> > Really, it's *guaranteed* that on a large-enough filesystem
> > (some TB, anyone?) you'll get two blocks with the same hash value.
>
> With a 512bit hash value from SHA-2(which is considdered
> collision-resistant), you'll probably get a collision roughly
> after 2**256 blocks you hash(Birthday paradox). This equates to
> an extremely large filesystem(2**218 PiB). By using SHA-1(which has
> Some problems, besides is limited size), you'll get a collision
> After about 2**80 blocks. 2**80 blocks is still a very large
> filesystem(2**42 PiB).
Yes, I know. Thank you for calculating ;-)

> > Therefore I asked whether the risk is acceptable ... there
> > has been some filesystem (I think that was more than 10 years
> > ago, didn't find a link) that tried deduplication by some
> > hash - but got shot down, because without
> > *verification* that the data is identical you might
> > *silently* shoot yourself (and all others) in the foot.
>
> By using a large enough hash-value there shouldn't be a problem.
> But it might be a filesystem-option.
My point is - either there *is* verification (then the hash function itself 
doesn't matter that much), or there is *none*.
In the latter case you risk trashing your data.

As the amount of data stored will only grow, there's an increasing risk of 
collisions.

And, if you use a 512 bit hash for 4096*8 bits of data, you have 1/64th of 
your storage wasted for the data index alone.

> > Ok.
> > But if verification is needed anyway, then something *much*
> > simpler (and
> > *much* faster) would be ok, too.
>
> Any hash-function that you'll use have a much shorter calculation
> time than any access to rotating media. Even SSD's are slower than
> what a mainstream CPU can calculate secure hashes from.
The calculation itself, yes.
But if you're getting 1MB of data, and have to tell some hardware to do 256 
individual SHA2 calculations of 4kB each, you'll have some latency.

If that's a simple calculation in the CPU, then you can already ask the SSD 
for the first (expected) data block after hashing the first 4kB.

Maybe it's better via extra hardware - I don't know.
I just think that
- a *big* hash, for collision-resistance, takes too much space; and
- a smaller hash has probably collisions in our lifetime.
So take some ASIC or GPU, and use that for a *simple* hash calculation; but 
*verify* the block, to make sure that nothing bad happens.


Regards,

Phil


_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list