[Tux3] Deduplication?
Chinmay Kamat
chinmaykamat at gmail.com
Sat Apr 18 10:00:04 PDT 2009
> On Thursday 16 April 2009, Ray Van Dolson wrote:
> > Potential end user here, hope you don't mind the intrusion. :-)
> >
> > I've read a bit here in the past on deduplication coming to Tux3. I'm
> > thinking of this in terms of block-based deduplication, a la what we
> > have on WAFL with NetApp. Will Tux3 have something along these lines?
> > Does it already?
Yes. As Daniel said, we have implemented block level deduplication in Tux3
Fuse implementation. The code uses a btree based index and what we call
'buckets' for faster search. The code with this implementation and collision
handling is at www.bitbucket.org/cdkamat/tux3_dedup . You can try it .. not
tested extensively :)
> >
> > We love dedupe for its application in virtualization environment (for
> > storing VMDK files specifically, which are great candicates for
> > deduplication). However, I know of no other filesystem other than WAFL
> > that has this functionality, and others like ZFS don't seem to be in a
> > hurry to add it for whatever reason. Maybe they don't see the value as
> > anything more than a niche thing more for backup administrators?
> >
> > Thanks!
> > Ray
>
> Our Pune Institute volunteers put in the work to demonstrate the
> effectiveness of deduplication, and to get a respectable level of
> functionality working. This pretty much decides the question for me.
> Block level deduplication will be supported by Tux3.
>
> There are competing approaches to deduplication that should also be
> investigated. Deduplication can be implemented at three different
> layers in the storage stack:
>
> 1) Block level
> 2) Filesystem level
> 3) Stacked filesystem level
>
> It is not immediately clear to me which is best. At least, by
> implementing at the filesystem level, there is no additional cost for
> storing metadata blocks. I suspect that the third approach, stacking
> filesystem, will eventually prove superior, as it removes complexity
> from the filesystem while not imposing much additional overhead. But
> that is for the future. Currently, the only effective way to stack a
> filesystem is by implementing it in user space under FUSE, and the FUSE
> codebase still looks a little young to me and not well suited to high
> volume applications. That could change in time, however, the current
> approach as part of the filesystem has the advantage of existing and
> apparently functions pretty well.
>
> Ongoing development of deduplication will be needed. I do not intend
> to put time into that myself, because other areas are more pressing.
> Volunteers to continue the work initiated by the Pune students would be
> welcome. First, the code needs to be forward ported to the current
> development tree, which would be an easy way for a volunteer or
> volunteers to get started. Then I would like to see the new "physical"
> btree used by the Pune code to store the block database changed to a
> logical form, where the new data is stored logically in a file. As it
> is, the physical form of the database would impose additional
> requirements on the atomic commit logging/replay mechanism, adding
> complexity. It would be preferable that deduplication not add code to
> the atomic commit path.
>
> Regards,
>
> Daniel
>
We might be done with our project, but we are still very much interested in
Tux 3 development. Daniel, we are a bit busy right now with our exams. We will
be back to active development, deduplication and tux3 ( at least some of us
will ). Will discuss the next approach with you. Will be on irc soon.
Regards,
Gaurav Tungatkar
Chinmay Kamat
Kushal Dalmia
Amey Magar
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list