[Tux3] Patch : Data Deduplication in Userspace
Christensen Stefan
stefan.christensen at uit.no
Wed Feb 25 07:57:17 PST 2009
> -----Original Message-----
> From: Philipp Marek [mailto:philipp.marek at emerion.com]
> Sent: Wednesday, February 25, 2009 1:54 PM
> On Mittwoch, 25. Februar 2009, Christensen Stefan wrote:
> My point is - either there *is* verification (then the hash
> function itself
> doesn't matter that much), or there is *none*.
> In the latter case you risk trashing your data.
>
> As the amount of data stored will only grow, there's an
> increasing risk of
> collisions.
>
> And, if you use a 512 bit hash for 4096*8 bits of data, you
> have 1/64th of
> your storage wasted for the data index alone.
That is a bit too much waste.
> But if you're getting 1MB of data, and have to tell some
> hardware to do 256
> individual SHA2 calculations of 4kB each, you'll have some latency.
I'm not quite sure how fast SHA-2 can be run on a current CPU, but I
don't think it would be slower than the transferspeed of disks(~70MiB/s).
> If that's a simple calculation in the CPU, then you can
> already ask the SSD
> for the first (expected) data block after hashing the first 4kB.
>
> Maybe it's better via extra hardware - I don't know.
> I just think that
> - a *big* hash, for collision-resistance, takes too much space; and
> - a smaller hash has probably collisions in our lifetime.
> So take some ASIC or GPU, and use that for a *simple* hash
> calculation; but
> *verify* the block, to make sure that nothing bad happens.
After thinking a bit I think you are right. Only use a easy hash to
reduce the amount of times you have to check the actual disk-img of
that block to see if it is the same. But you would have to have a list
of diskblocks that have the same hash value but are diffrent. Maybe a
good size for the hashvalue would be 64bits then. It will yield a
collision every 2**32 blocks, but wouldn't take up too much space. And
the criteria for the hash would be few instructions. But you will not
need to offload the hash to an ASIC/GPU.
--
Stefan
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list