<div>Very good news !</div>
<div> </div>
<div>
<div>This would be so cool to be able to use tux3 as a backend for SAN ! Journaling data + meta is a solution, or sync writes (I don't know if OCFS2 works) but I miss journaling and performance, well, just sux... Last thing is that to my knowledge even ZFS is not ready for that, and I would love to see tux3 beating it on that ground :)</div>
</div>
<div> </div>
<div>Just a note : About clustered filesystem, the "no data loss" part is crutial if you host luns on the filesystem, because if a takeover occurs and the lun does not contain exactly what the initiator expect it's likely that bad things will happen (or worse), and that your takeover were pretty useless (or worse than a target service shutdown). <br>
</div>
<div>And thanks again for your attention !</div>
<div><br> </div>
<div><span class="gmail_quote">On 1/7/09, <b class="gmail_sendername">Daniel Phillips</b> <<a href="mailto:phillips@phunq.net">phillips@phunq.net</a>> wrote:</span>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">On Tuesday 06 January 2009 06:47, Michael Keulkeul wrote:<br>> Hi<br>> First I must say that tux3 is the coolest and cleanest (in many ways)<br>
> filesystem project I've seen sofar. I've seen a discussion thread about cool<br>> features, so I add mine.<br><br>Wow, geek praise doesn't come any higher than cool and clean, thanks<br>for that :-)<br>
<br>> Clustered filesystem :<br>> No grid or things like that, but the ability to maintain a coherent cache on<br>> a single other host that have access to the same disk backend in order to<br>> get a read only acces from this other host, and enable fast filesystem<br>
> takeover (Make the filesystem read/write on the "passive" host in 100ms or<br>> less) with no data loss. This would be nice to be able to use tux3 as<br>> backend for high availability luns, and might provide some "NVRAM" if you<br>
> assume that a each host of the cluster will not fail at the same time (each<br>> single host could provide some memory to the other).<br><br>Extending the Tux3 atomic commit to a cluster will be a fine project<br>
for later on. I don't think there is much point in doing a read only<br>partial cluster implementation. The techniques for keeping a coherent<br>cache on a cluster are well established by now.<br><br>More immediately, we will have replication, just as ddsnap/Zumastor<br>
has it now, except more efficient. This does not meet your 100 ms<br>takeover goal, but it will be useful for many common situations, like<br>serving home directories. It does not necessarily fail over with the<br>very latest data, which might not have been replicated yet.<br>
<br>The proper way to do what you want is with a cluster filesystem. It<br>would be lots of fun to turn Tux3 into a cluster filesystem. There<br>are plenty of interesting puzzles to solve. For now, OCFS2 is pretty<br>good.<br>
<br>> Filesystem freeze :<br>> Get an utility that flush cache and return something when it's done, then<br>> freeze IO on disk and throttle/stack in a memory buffer until it's full.<br>> When it's full, return again something and resume normal operation, or<br>
> freeze IO until we ask to resume. This in order to take clean snapshots when<br>> backend support versionning. Even if it's not necessary due to tux3 design,<br>> it would be nice to be able to do it in order to ensure that some IO are<br>
> commited to disk, then get some time to do something to the disk backend,<br>> with no impact on the filesystem side.<br><br>I think all you want there is the ability to treat a snapshot as a<br>barrier: user asks for a snapshot, Tux3 starts a new delta and sets a<br>
flag on it; when that snapshot has committed, the snapshot request<br>is acknowledged. That way, the user gets a snapshot of what has been<br>sent to the filesystem most recently, without needing to stall the<br>filesystem throughput.<br>
<br>Tux3 does not need a new memory buffer for this, the needed mechanism<br>is just what has already been designed.<br><br>> Choose bitmaps or extends at filesystem creation time :<br>> Because you sometime know that fragmentation will be your worse foe (that<br>
> can happen if you keep a lot of versions), and you don't really care about<br>> metadata weight. If we could just choose, even without any chance to change<br>> this after creation time, it would be very very nice.<br>
<br>I think we will be able to make that decision automatically, pretty<br>reliably. There is a crossover point where extents become more compact<br>than bitmaps. The plan is to convert automatically on crossing the<br>
threshold, being a little lazy to avoid too many conversions. If that<br>proves too hard to implement, then a mount option might be a reasonable<br>approach.<br><br>> Thanks for your time reading this, and tell me if something does not make<br>
> sense, english is not my first language !<br>> And thanks for all your efforts providing us a real modern linux filesystem<br>> that deserve that name !<br>><br>> Michael<br><br>I read it a couple of times and it made more sense the second time :-)<br>
<br>You may have to wait a while for cluster Tux3, but the other two should<br>arrive over the next few months.<br><br>Regards,<br><br>Daniel<br></blockquote></div><br>