[Tux3] Deduplication?

Sat Apr 18 10:00:04 PDT 2009

> On Thursday 16 April 2009, Ray Van Dolson wrote:
> > Potential end user here, hope you don't mind the intrusion. :-)
> >
> > I've read a bit here in the past on deduplication coming to Tux3.  I'm
> > thinking of this in terms of block-based deduplication, a la what we
> > have on WAFL with NetApp.  Will Tux3 have something along these lines?
> > Does it already?

Yes. As Daniel said, we have implemented block level deduplication in Tux3 
Fuse implementation. The code uses a btree based index and what we call 
'buckets' for faster search. The code with this implementation and collision 
handling is at www.bitbucket.org/cdkamat/tux3_dedup . You can try it .. not 
tested extensively :)

> >
> > We love dedupe for its application in virtualization environment (for
> > storing VMDK files specifically, which are great candicates for
> > deduplication).  However, I know of no other filesystem other than WAFL
> > that has this functionality, and others like ZFS don't seem to be in a
> > hurry to add it for whatever reason.  Maybe they don't see the value as
> > anything more than a niche thing more for backup administrators?
> >
> > Thanks!
> > Ray
>
> Our Pune Institute volunteers put in the work to demonstrate the
> effectiveness of deduplication, and to get a respectable level of
> functionality working.  This pretty much decides the question for me.
> Block level deduplication will be supported by Tux3.
>
> There are competing approaches to deduplication that should also be
> investigated.  Deduplication can be implemented at three different
> layers in the storage stack:
>
>   1) Block level
>   2) Filesystem level
>   3) Stacked filesystem level
>
> It is not immediately clear to me which is best.  At least, by
> implementing at the filesystem level, there is no additional cost for
> storing metadata blocks.  I suspect that the third approach, stacking
> filesystem, will eventually prove superior, as it removes complexity
> from the filesystem while not imposing much additional overhead.  But
> that is for the future.  Currently, the only effective way to stack a
> filesystem is by implementing it in user space under FUSE, and the FUSE
> codebase still looks a little young to me and not well suited to high
> volume applications.  That could change in time, however, the current
> approach as part of the filesystem has the advantage of existing and
> apparently functions pretty well.
>
> Ongoing development of deduplication will be needed.  I do not intend
> to put time into that myself, because other areas are more pressing.
> Volunteers to continue the work initiated by the Pune students would be
> welcome.  First, the code needs to be forward ported to the current
> development tree, which would be an easy way for a volunteer or
> volunteers to get started.  Then I would like to see the new "physical"
> btree used by the Pune code to store the block database changed to a
> logical form, where the new data is stored logically in a file.  As it
> is, the physical form of the database would impose additional
> requirements on the atomic commit logging/replay mechanism, adding
> complexity.  It would be preferable that deduplication not add code to
> the atomic commit path.
>
> Regards,
>
> Daniel
>

We might be done with our project, but we are still very much interested in 
Tux 3 development. Daniel, we are a bit busy right now with our exams. We will 
be back to active development, deduplication and tux3 ( at least some of us 
will ). Will discuss the next approach with you. Will be on irc soon. 

Regards, 
Gaurav Tungatkar
Chinmay Kamat
Kushal Dalmia
Amey Magar

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3