[Tux3] Patch : Data Deduplication in Userspace

Christensen Stefan stefan.christensen at uit.no
Wed Feb 25 07:57:17 PST 2009


> -----Original Message-----
> From: Philipp Marek [mailto:philipp.marek at emerion.com] 
> Sent: Wednesday, February 25, 2009 1:54 PM
> On Mittwoch, 25. Februar 2009, Christensen Stefan wrote:

> My point is - either there *is* verification (then the hash 
> function itself 
> doesn't matter that much), or there is *none*.
> In the latter case you risk trashing your data.
> 
> As the amount of data stored will only grow, there's an 
> increasing risk of 
> collisions.
> 
> And, if you use a 512 bit hash for 4096*8 bits of data, you 
> have 1/64th of 
> your storage wasted for the data index alone.

That is a bit too much waste.

> But if you're getting 1MB of data, and have to tell some 
> hardware to do 256 
> individual SHA2 calculations of 4kB each, you'll have some latency.

I'm not quite sure how fast SHA-2 can be run on a current CPU, but I
don't think it would be slower than the transferspeed of disks(~70MiB/s).

> If that's a simple calculation in the CPU, then you can 
> already ask the SSD 
> for the first (expected) data block after hashing the first 4kB.
> 
> Maybe it's better via extra hardware - I don't know.
> I just think that
> - a *big* hash, for collision-resistance, takes too much space; and
> - a smaller hash has probably collisions in our lifetime.
> So take some ASIC or GPU, and use that for a *simple* hash 
> calculation; but 
> *verify* the block, to make sure that nothing bad happens.

After thinking a bit I think you are right. Only use a easy hash to
reduce the amount of times you have to check the actual disk-img of
that block to see if it is the same. But you would have to have a list 
of diskblocks that have the same hash value but are diffrent. Maybe a
good size for the hashvalue would be 64bits then. It will yield a
collision every 2**32 blocks, but wouldn't take up too much space. And 
the criteria for the hash would be few instructions. But you will not 
need to offload the hash to an ASIC/GPU.

--

Stefan

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list