[Tux3] Deduplication?

Chinmay Kamat chinmaykamat at gmail.com
Sat Apr 18 10:00:04 PDT 2009

> On Thursday 16 April 2009, Ray Van Dolson wrote:
> > Potential end user here, hope you don't mind the intrusion. :-)
> >
> > I've read a bit here in the past on deduplication coming to Tux3.  I'm
> > thinking of this in terms of block-based deduplication, a la what we
> > have on WAFL with NetApp.  Will Tux3 have something along these lines?
> > Does it already?

Yes. As Daniel said, we have implemented block level deduplication in Tux3 
Fuse implementation. The code uses a btree based index and what we call 
'buckets' for faster search. The code with this implementation and collision 
handling is at www.bitbucket.org/cdkamat/tux3_dedup . You can try it .. not 
tested extensively :)

> >
> > We love dedupe for its application in virtualization environment (for
> > storing VMDK files specifically, which are great candicates for
> > deduplication).  However, I know of no other filesystem other than WAFL
> > that has this functionality, and others like ZFS don't seem to be in a
> > hurry to add it for whatever reason.  Maybe they don't see the value as
> > anything more than a niche thing more for backup administrators?
> >
> > Thanks!
> > Ray
> Our Pune Institute volunteers put in the work to demonstrate the
> effectiveness of deduplication, and to get a respectable level of
> functionality working.  This pretty much decides the question for me.
> Block level deduplication will be supported by Tux3.
> There are competing approaches to deduplication that should also be
> investigated.  Deduplication can be implemented at three different
> layers in the storage stack:
>   1) Block level
>   2) Filesystem level
>   3) Stacked filesystem level
> It is not immediately clear to me which is best.  At least, by
> implementing at the filesystem level, there is no additional cost for
> storing metadata blocks.  I suspect that the third approach, stacking
> filesystem, will eventually prove superior, as it removes complexity
> from the filesystem while not imposing much additional overhead.  But
> that is for the future.  Currently, the only effective way to stack a
> filesystem is by implementing it in user space under FUSE, and the FUSE
> codebase still looks a little young to me and not well suited to high
> volume applications.  That could change in time, however, the current
> approach as part of the filesystem has the advantage of existing and
> apparently functions pretty well.
> Ongoing development of deduplication will be needed.  I do not intend
> to put time into that myself, because other areas are more pressing.
> Volunteers to continue the work initiated by the Pune students would be
> welcome.  First, the code needs to be forward ported to the current
> development tree, which would be an easy way for a volunteer or
> volunteers to get started.  Then I would like to see the new "physical"
> btree used by the Pune code to store the block database changed to a
> logical form, where the new data is stored logically in a file.  As it
> is, the physical form of the database would impose additional
> requirements on the atomic commit logging/replay mechanism, adding
> complexity.  It would be preferable that deduplication not add code to
> the atomic commit path.
> Regards,
> Daniel

We might be done with our project, but we are still very much interested in 
Tux 3 development. Daniel, we are a bit busy right now with our exams. We will 
be back to active development, deduplication and tux3 ( at least some of us 
will ). Will discuss the next approach with you. Will be on irc soon. 

Gaurav Tungatkar
Chinmay Kamat
Kushal Dalmia
Amey Magar

Tux3 mailing list
Tux3 at tux3.org

More information about the Tux3 mailing list