<div dir="ltr"><MaZe> ugh, oh right, what was the homework?<br><flips> read the superblock? ;-)<br><RazvanM> flips: homework is: know how the root dir is loaded and initialized, and now that differs from how any other inode is opened<br>

<flips> it was about loading the root directory<br>2008-09-16 20:00 -!- pranith(<a href="mailto:7aa040b1@webchat.mibbit.com">7aa040b1@webchat.mibbit.com</a>) has joined #tux3<br><flips> and what did we find?<br>

<MaZe> that it gets loaded explicitely<br><flips> because...<br><flips> because dir lookup doesn't work<br><MaZe> well it's the mount point<br><flips> because there is no dir to look up in<br>

<MaZe> root of the tree and all that<br><RazvanM> ACTION is searching for s_root...<br><flips> so we have to open the root dir "manually", using functionality that normally gets called by something like ext2_lookup<br>

<flips> not quite that function<br><flips> anyway<br><flips> we're starting somewhere different today<br><flips> because maze wants to go faster ;)<br><flips> so let's go to sys_write<br>

<MaZe> I'm guiltless - I tell you...<br><RazvanM> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/dcache.c#L1062">http://lxr.linux.no/linux+v2.6.26.5/fs/dcache.c#L1062</a><br><RazvanM> ok ok ok ok<br>

<MaZe> I think we killed lxr<br><flips> seems<br><flips> next time I'll go there before I announce the destination ;)<br><RazvanM> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L370">http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L370</a><br>

<MaZe> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L370">http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L370</a><br><RazvanM> it works from here<br><pranith> works here too<br><MaZe> Razvan's always faster ;-)<br>

<flips> ok, who wants to walk down into it?<br><flips> instead of me this time?<br><flips> seems to me, razvanm does that pretty well<br><flips> you know the first few layers<br><flips> it's just the same idea as sys_open<br>

<RazvanM> ACTION is doesn't too much about fs yet :(<br><flips> you know how to poke down into a syscall though<br><MaZe> file_pos_read and file_pos_write are probably to fetch and store the current file offset<br>

<flips> just keep clicking until you see something that isn't obvious<br><flips> let's look at those<br><MaZe> fget_light and fput_light must be fd to struct file lookup with locking<br><MaZe> so all that's left is vfs_write<br>

<flips> pretty simple (file_pos_read/write)<br><MaZe> which was kind of obvious to begin with ;-)<br><flips> I don't know why they're even abstracted<br><flips> fget/put_light are demented<br>

<flips> two of the most subtle and demented functions in the entire kernel<br><flips> don't worry about them today ;)<br><flips> they were conceived by a vile an twisted mind, and get to live because they are fast<br>

<MaZe> what's demented about them?<br><flips> heh<br><flips> later<br><flips> really<br><flips> google if you must<br><RazvanM> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L313">http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L313</a><br>

<MaZe> ok, that's vfs_write<br><flips> suffice to say that they keep our file from disappearing while we are writing to it<br><flips> it would be bad otherwise<br><MaZe> right - locking<br><flips> razvanm, good, and what do you see there?<br>

<MaZe> a bunch of permission checks<br><MaZe> and then a f_op->write call<br><RazvanM> f_op->write if exists<br><flips> typical, right?<br><MaZe> provided it's available<br><flips> what you don't see is any locks being taken<br>

<RazvanM> ot do_sync_write otherwise<br><RazvanM> ot = or<br><flips> there is _very little locking_ in this path<br><flips> helping make it fast<br><MaZe> and a cute inc_syscw<br>2008-09-16 20:09 -!- kbingham(~<a href="mailto:kbingham@92.8.217.48">kbingham@92.8.217.48</a>) has joined #tux3<br>

<flips> the consequence of that is, the filesystem can be hit in a very parallel way<br><RazvanM> what is rw_verify_area?<br><MaZe> probably locking<br><flips> sometimes in ways that don't make sense, or are from buggy, racy applications, and the filesystem has to do something reasonable<br>

<flips> i.e., not crash and not corrupt<br><flips> rw_verify_area... hmm<br><MaZe> as in byte-range locks<br><flips> newish thing<br><flips> no sorry<br><flips> it's implementing flock<br>

<flips> bad name<br><flips> very<br><RazvanM> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L196">http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L196</a><br><flips> we don't care about it really<br>

<MaZe> I'd guess it checks no-one else has locked the area we're about to write to<br><flips> normally nobody uses flock<br><flips> crufty old baggage<br><flips> more interesting that selinux has a hook there<br>

<- typical selinux hook<br><pranith> flips: inc_syscw.. tsk->syscw++<br><flips> but this is not really interesting, let's pop back out and go deeper<br><MaZe> that's a generic security hook though right?<br>

<flips> yes<br><flips> I forget what we call the generic harness<br><- back here<br><flips> next we see that meme again<br><flips> our fs can either completely replace the write logic with its own, or the vfs will supply a basic framework and call lower level methods in the fs<br>

<flips>  327                if (file->f_op->write)<br><flips>  328                        ret = file->f_op->write(file, buf, count, pos);<br><flips> very few fs's will use this hook<br><pranith> i thought we were supposed to use the vfs framework...<br>

<RazvanM> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L288">http://lxr.linux.no/linux+v2.6.26.5/fs/read_write.c#L288</a><br><flips> almost all continue on down into do_sync_write<br><flips> which is still the vfs<br>

<flips> most filesystems don't want to have the responsibility of doing all the things the vfs is about to do now<br>2008-09-16 20:16 -!- amey(~<a href="mailto:amey@116.73.35.180">amey@116.73.35.180</a>) has joined #tux3<br>

<- do_sync_write<br><flips> so, internally the kernel is kind of aio oriented<br><flips> asynchronous IO<br><flips> and synchronous IO is just a shell around it of the form "start and IO op; wait on a wait queue until its done"<br>

<flips> we see that here<br><flips> very simple... if you don't poke into the details<br><flips> we will, but later<br><MaZe> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/file.c#L50">http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/file.c#L50</a><br>

<MaZe> <a href="http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2487">http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2487</a><br><flips> so now... we lose the trail<br><flips> because the vfs calls the real write action through a variable<br>

<flips> any suggestions how we can pick up that trail again?<br><RazvanM> aio_write :P<br><flips> filp->f_op->aio_write<br><flips> right<br><flips> we can grep the entire kernel for it<br>

<flips> or we can go back to ext2/inode.c<br><flips> where I know it is ;)<br><flips> let's do that<br><MaZe> <a href="http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2364">http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2364</a><br>

<flips> you're getting ahead ;)<br><flips> let's see how we get there<br><flips> and I was wrong about the file<br><flips> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/file.c#L50">http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/file.c#L50</a><br>

<RazvanM> interesting<br><MaZe> ?<br><flips> now we see that ext2 just fills that in with a generic function<br><flips> that maze already found<br><flips> so lets clikc on it and go to filemap<br>

<RazvanM> even this a fs is not interesting in implementing it :D<br><flips> that's right<br><flips> ext2 mostly lets the vfs do everything for it<br><flips> and its still 7,500 lines long<br><flips> worth considering what's in those 7,500 lines<br>

<flips> keep in mind that the VFS was essentially created just by taking a functioning filesystem and chopping it in half<br><flips> the top half, which became the vfs<br><flips> and the bottom half, which is a bunch of specific methods for doing things like figuring out the position of a block on disk<br>

<MaZe> and the bottom half which became the fs drivers<br><RazvanM> ext2 should still have something to say about the write...<br><flips> which because ext2 and all its friends<br><MaZe> might not<br>

<MaZe> ext2 is not journaled<br><MaZe> might just have a get_disk_block(file, offset)<br><flips> ext2 is happy to let the vfs take over completely here, but of course, the vfs will come back to ext2 at some point<br>

<pranith> why not ext3?<br><MaZe> and allocate/free_disk_block<br><flips> we will get there in about 5-10 minutes<br><pranith> ok<br><flips> for comparison, you could look at ext3/file.c<br><flips> let's do that later<br>

<flips> <a href="http://lxr.linux.no/linux+v2.6.26.5/+code=generic_file_aio_write">http://lxr.linux.no/linux+v2.6.26.5/+code=generic_file_aio_write</a><br><flips> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/file.c#L50">http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/file.c#L50</a><br>

<MaZe> ext2 is not journaled - so each file is just a read/write collection of blocks on disk<br><flips> even ext3 doesn't normally journal data<br><MaZe> so all you need is the ability to lookup a given files/offsets block location on disk and you can read/write just fine<br>

<MaZe> but it can...<br><RazvanM> next step: <a href="http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2364">http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2364</a><br><flips> yes, and so it must supply different methods for its different journalling options<br>

<MaZe> <a href="http://lxr.linux.no/linux+v2.6.26.5/fs/ext3/file.c#L113">http://lxr.linux.no/linux+v2.6.26.5/fs/ext3/file.c#L113</a><br><flips> not *must*, but that is what it does<br><MaZe>  113        .aio_read       = generic_file_aio_read,<br>

<MaZe> 114        .aio_write      = ext3_file_write, <br><MaZe> so ext3 has it's own write, but uses the generic read<br><flips> thanks razvanm<br><flips> notice that generic_file_aio_write didn't really do much<br>

<RazvanM> generic read but custom write... interesting<br><flips> jsut took care of some options<br><flips> optional unix semantics<br><flips> razvanm, sure, no journal needed on read<br><flips> finally, __generic_file_aio_write_nolock is doing something<br>

<flips> not much... but more than the others<br><RazvanM> aaaa... ext3 :D<br><MaZe> since on read you can just let the generic file/offset block lookup code handle it, but on write - you might need to go through the journal if the right mount optiones (data=ordered I think) were used<br>

<MaZe> or data=journaled - never sure<br><flips> here we see readv being implemented<br><flips> um<br><flips> writev<br><pranith> where?<br><flips> generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);<br>

<flips> nr_segs... writev segs<br><flips> not important<br><flips> easy enough to understand<br><MaZe> is that verifying we can read the ram the user passed us?<br><flips> probably<br><flips> let's find out<br>

<flips> 1149                /*<br><flips> 1150                 * If any segment has a negative length, or the cumulative<br><flips> 1151                 * length ever wraps negative then return -EINVAL.<br>

<flips> 1152                 */<br><flips> no, just checking for properly formed structs<br><MaZe> if (access_ok(access_flags, iv->iov_base, iv->iov_len)) <br><MaZe> I htink it does full access checks<br>

<flips> security<br><MaZe> note the return -EFAULT<br><flips> so we will rely on the mmu<br><flips> to fault<br><flips> and sometimes check for faulting contitions by hand<br><- access_ok just within memory or not<br>

<MaZe> no I think it checks by hand, but only returns EFAULT if first part is bad, otherwise it marks how many are good, and ignore the rest<br><flips> vfs_check_frozen implements the filesystem "freeze" feature... which is used for snapshotting<br>

<flips> kind of misconceived<br><MaZe> so you'll get a partial write instead of an EFAULT if you have a bad mapping in the middle of a writev<br><flips> sounds reasonable<br><MaZe> can't realy on mmu since we probably will use dma<br>

<flips> then we have a bunch of code associated with direct IO<br><flips> which we are going to skip<br><MaZe> <a href="http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2319">http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2319</a><br>

<flips> maze, true<br><flips> so we're going to check access somewhere<br><flips> but not here<br><flips> notice, no real work got done<br><flips> we're still just deepening the call chain and allowing for various options and whatnot<br>

<MaZe> at this point, we're seriously not expecting any real work to get done ;-)<br><flips> then we get to generic_file_buffered_write<br><RazvanM> ACTION does! :D<br><flips> think that's going to do work?<br>

<MaZe> nope<br><flips> you'd be right<br><flips> short break<br><flips> while I fill the wine glass<br><pranith> wine? i thought u wanted beer<br><pranith> ;)<br><flips> nobody sent any<br>

<pranith> aww<br><flips> ok here we go again<br><RazvanM> ACTION thinks a_ops->write_begin must be the key...<br><flips> we have a ->write_begin option<br><flips> which is new for me<br>

<MaZe> the two functions are right next to each other<br><MaZe> and look similat<br><flips> and that 2copy thing, likewise<br><MaZe> probably something aio related<br><flips> looks like braindamange<br>

<RazvanM> the 2copy is also using some a_ops<br><MaZe> notice a_ops<br><MaZe> is struct addres_space_operations<br><RazvanM> <a href="http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L444">http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L444</a><br>

<flips> lost the scent for a moment<br><RazvanM> ACTION knows readpage from romfs...<br><MaZe> sounds mmap-ish<br><pranith> ACTION has to go to work :(<br><pranith> ACTION says bbyee, do post the logs ...<br>

<MaZe> guessing a_ops are operations that can be performed on mmaped fs pages<br><MaZe> with ability for fs to override it to trigger journaling etc<br><MaZe> bye bye<br><flips> ok, this code has bben "worked on"<br>

<flips> rearranged hopefully for a good reason<br><RazvanM> readpage is the only 'read' the romfs is doing<br><flips> <a href="http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2231">http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2231</a><br>

<RazvanM> so its called not only for mmap stuff<br><flips> generic_perform_write<br><MaZe> that may be an optimization though<br><flips> this is where the real action happens<br><MaZe> who knows...<br>

<flips> or one form of real action<br><MaZe> <a href="http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2231">http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2231</a><br><flips> we're going to talk about a_ops<br>

<flips> this is the key to most filesystem io in linux<br><flips> ok, so here is a typical write mem<br><RazvanM> write_begin, write_end<br><flips> right<br><flips> and in between we copy data from userspace<br>

<flips> onto a page<br><MaZe>  copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); <br><flips> so what is in write_beging? probably get a page into the page cache of an inode<br><flips> and write_end will send that page down to the hardware<br>

<MaZe> looks like the kernel basically mmaps in the page and then mmaps it out<br><flips> copy_from_user gets the data, and generates EFAULT if necessary<br><flips> either because of illegal access, or page swapped out<br>

<MaZe>      pagefault_disable(); <br><MaZe> uhm?<br><flips> things get interested in the page was swapped out to a swapfile onthe same filesystem<br><flips> interesting<br><RazvanM> swapfile on the same filesystem??<br>

<flips> right<br><RazvanM> swapfile is not a separate fs?<br><flips> trying to prevent recursive fault<br><MaZe> sounds like that just turned off page-in<br><flips> I don't have the details at hand just now<br>

<flips> razvanm, swap can be separate, or it can be on a filesystem<br><flips> there are some nasty possible recursions when its on a filesystem<br><MaZe> very nasty<br><RazvanM> ACTION doesn't know how to create a swap on a fs :|<br>

<flips> 2 minutes until question time<br><flips> it's going to be another "cliffhanger" ending<br><RazvanM> :-)<br><MaZe> lol<br><flips> now this function is not very instructive<br>

<flips> because it doesn't directly use the page cache ops<br><flips> it provides hooks for them<br><MaZe> are you sure we went into the right function? not the 2copy one?<br><flips> let's see if we can pop out and find a variant that does use the page cache ops<br>

<flips> I'm sure we didn't<br><flips> somebody has been messing with names<br><flips> I hope it was for a good reason<br><flips> it isn't always<br><MaZe> <a href="http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2063">http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2063</a><br>

<flips> and as you can see, the call chain is kind of unreasonably deep<br><MaZe> this all seems extremely complex<br><MaZe> for now I can't say unnecessarily... but...<br><RazvanM> what does the 2copy mean?<br>

<flips> yes, this looks like what remains of good old generic_write<br><MaZe> it means brain-dead original 1st copy apparently<br><flips> maze, I am happy to have reached your "complex" threshold<br>

<flips> it gets more complex<br><flips> in _2copy, we will alloc pages, map them into a page cache, copy data onto them, and submit them to disk<br><flips> we will call the fs's ->write_page method to do the latter<br>

<flips> and that method will figure out _where_ on disk the page should go<br><flips> I don't know wyat 2copy means<br><MaZe> why do we have to copy_from_user<br><MaZe> can't we write directly from userspace data?<br>

<flips> feels like... wanking... but I will know for sure for thursdays's session<br><flips> maze, because this is _buffered_ write<br><flips> we are placing the data in cache<br><MaZe> oh, right<br>

<flips> we can't just place references to pages in cache<br><flips> because the user data is not necessarily properly aligned<br><MaZe> couldn't we just rip the page out from under the user, and give him a r/o cow page?<br>

<flips> linus does want to attempt something like that<br><flips> but it's too hard, even for him<br><RazvanM> ACTION doesn't see the write_page....<br><flips> me neither<br><RazvanM> there is prepare_write<br>

<MaZe> <a href="http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2192">http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2192</a><br><flips> home is: see the writepage<br><flips> ;-)<br><RazvanM> and commit_write<br>

<flips> on thursday we will pick up at the writepage<br><MaZe> I'm not sure why there would need to be a write page<br><flips> yep, it looks like _2copy really is the new incarnation of generic_write<br>

<flips> it used to just be generic_write<br><flips> but then it started getting more and more "wrapped"<br><flips> until we see this thing<br><flips> unreadable thing you could say<br><RazvanM> :-)<br>

<flips> maze, the purpose of the ->writepages in there is to get dirty, buffered pages onto disk<br>2008-09-16 20:57 -!- kbingham(~<a href="mailto:kbingham@92.20.210.138">kbingham@92.20.210.138</a>) has joined #tux3<br>

<MaZe> won't commit_write do that?<br><flips> ah, that's what you asked<br><flips> why two<br><flips> no good reason actually<br><flips> there's usually a "prepare_write" and a "commit_write"<br>

<RazvanM> <a href="http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L458">http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L458</a><br><flips> one or the other generally doesn't do much<br><MaZe> there's a writeage, writepages, prepatre_write,commit_write,write_begin,write_end ...<br>

<MaZe> pick'n'choose<br><flips> yes<br><flips> big mess<br><flips> linux IO is trying to find its identity <br><MaZe> lol<br><RazvanM> it was simpler and nicer in the past?<br><flips> beginning of 2.6 was simpler, yes<br>

<flips> o_direct is a very good thing, but it added considerable complexity<br><MaZe> it looks like different file systems use different interfaces<br><flips> likewise aio<br><flips> maze, somewhat true<br>

<flips> almost everybody uses generic_write<br><MaZe> and thus we have a lot<br><flips> not much global structural analysis goes on<br><flips> so that the structure can be simplified<br><flips> because that doesn't add new features<br>

<flips> or fix bugs<br><MaZe> are the address_space_operations fs internal?<br><flips> introduces them more likely<br><MaZe> or are they more global mm?<br><flips> but it makes the code messy<br>

<flips> like many such things in linux, they are usually library methods<br><flips> kernel library<br><flips> which the fs can lightly wrap<br><flips> or use directly<br><flips> the ->writepages thing is a relatively new invention<br>

<flips> that allows the filesystem to map more than one page at a time for IO<br><flips> lead to nice benchmark improvements<br><flips> and more mess in filemap.c<br><tim_dimm> and this is where variable page sizes will get interesting<br>

<flips> filemap.c is where most of the impact is, yes<br><flips> insightful<br><flips> 4 minutes over ;)<br><flips> how did we do for pacing today?<br><tim_dimm> i try<br><tim_dimm> nice pace<br>

<MaZe> pretty decent I think<br><tim_dimm> sorry I asked so many questions<br><flips> ok, we will be back into write on thursday<br><tim_dimm> ;-)<br><MaZe> tim_dimm: ask questions - it's the only way to learn anything<br>

<RazvanM> ACTION is not happy with the length though ;-)<br><flips> homework is: find the implementations of the ->writepage calls in ext2<br><tim_dimm> I was just trying to figure out what / where to read<br>

<tim_dimm> never been inside the kernel like that before<br><flips> it's bizarre, isn't it<br><noob<br><tim_dimm> yeah<br><MaZe> so here's a question: buffered, aio, o_direct - what are the permutations/combinations, what do they mean, and how do they interact with each other if the same spot is being accessed via different means<br>

<flips> maze, very good question, and the answer is: with considerable complexity<br><MaZe> lovely answer<br><flips> it is necessary to maintain cache consistency with all possible combinations<br><MaZe> that's like my friend at work, who sits next to me and regularly answers either/or questions with a 'yes' spoken in a deadpan voice<br>

<tim_dimm> are there hooks for cache consistency or is it handle another way?<br><flips> that is why that section handling o_direct that we skipped is so... um... interesting<br><flips> tim_dimm, the vfs handles it<br>

<flips> and there are rules that the fs has to follow<br><MaZe> O_DIRECT means unbuffered straight to disk, right?<br><flips> basically "do not skate over that cliff"<br><MaZe> and is pretty meaningless for read...<br>

<flips> maze, right<br><flips> o_direct write has to invalidate any buffer data at that point<br><MaZe> all synchronous io should be easily implementable via aio<br><flips> also flush out dirty buffered data in that range<br>

<tim_dimm> did you guys cover vfs on another tux3 night?<br><flips> maze, it is<br><flips> tim_dimm, partly<br><flips> this is part of the vfs we're doing now<br><MaZe> so you basically need to support {buffered | direct } asynchronous io<br>

<tim_dimm> would it be worthwhile to have an entire session on it?'<br><flips> we did an easy one first<br><flips> maze, yes<br><flips> in fact we already looked at the functions that support it<br>

<flips> tim_dimm, that was essentially the first session<br><MaZe> o_direct write has to invalidate any buffered data at that point - uh? <br><tim_dimm> k, I'll revisit in the logs<br><flips> maze, yes<br>

<MaZe> buffered data for what?<br><flips> somebody might have been reading/writing the device with buffered ops at the same time<br><flips> this is not uncommon<br><MaZe> oh, the buffered but not yet written stuff gets dropped?<br>

<flips> flushed to disk<br><MaZe> or overwritten with the - so flushed, not invalidated<br><MaZe> what gets invalidateD?<br><flips> you're right, fully replaced pages get dropped<br><flips> partially replaced pages have to be flushed<br>

<MaZe> so it's not so much invalidated, as overwritten and thus dropped/replaced with the new data<br><flips> right<br><flips> haven't spent a lot of time in that code myself<br><flips> but that's correct<br>

<MaZe> does O_DIRECT mean anything on read?<br><flips> yes<br><flips> will not read from buffer afaic<br><MaZe> Try to minimize cache effects of the I/O to and from this  file<br><flips> but I could be wrong<br>

<MaZe> according to man open, basically skip buffer cache populating<br><flips> anything not buffered is read directly from disk and not added to the page cache<br><MaZe> unless already there<br><flips> so o_direct read avoids double buffering<br>

<RazvanM>        O_DIRECT (Since Linux 2.4.10)<br><RazvanM>               Try  to  minimize cache effects of the I/O to and from this file.  In general this will degrade performance, but it is useful in special<br>

<RazvanM>               situations, such as when applications do their own caching.  File I/O is done directly to/from user space  buffers.   The  I/O  is  syn-<br><RazvanM>               chronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred.  See NOTES below for further<br>

<RazvanM>               discussion.<br><RazvanM>               A semantically similar (but deprecated) interface for block devices is described in raw(8).<br><flips> I'm not sure what it does with already-buffered data<br>

<flips> if dirty then it _must_ use the dirty version<br><MaZe> so, how expensive is a write to read only page fault?<br><RazvanM> from man 2 open, sorry for the long lines<br><flips> but I don't know if it does that by flushing it first, then reading it back, or doing buffered read just for that bit<br>

<MaZe> yeah, found it<br><MaZe> doesn't look like there's any requirement to flush<br><MaZe> seems like O_DIRECT read is meant for access once - not worth caching - data<br><flips> yes<br><flips> still leaves the question about what it does with pages already in cache, or dirty in cache<br>

<MaZe> it says minimize<br><flips> shall we leave that as your homework?<br><MaZe> not ignore cache<br><flips> can't rely on the man page<br><RazvanM> the pages should not be dirty for too long<br>

<flips> have to read the code<br><RazvanM> :D<br><MaZe> from NOTES<br><MaZe> Applications  should  avoid  mixing O_DIRECT and normal I/O to the same<br><MaZe>        file, and especially to overlapping byte  regions  in  the  same  file.<br>

<MaZe>        Even when the filesystem correctly handles the coherency issues in this<br><MaZe>        situation, overall I/O throughput is likely to  be  slower  than  using<br><MaZe>        either  mode alone.  Likewise, applications should avoid mixing mmap(2)<br>

<MaZe>        of files with direct I/O to the same files.<br><flips> one thing you see is that o_direct has to be constantly checking the page cache to be sure nothing is aliased there<br><MaZe> "The  thing  that has always disturbed me about O_DIRECT is that<br>

<MaZe>               the whole interface is just stupid, and was probably designed by<br><MaZe>               a  deranged monkey on some serious mind-controlling substances."<br><MaZe>               �" Linus<br>

<flips> maze, the advice is often ignored<br><flips> linux is not absolved from responsibiltiy for keeping the cache consistent<br><MaZe> right<br><flips> linus doesn't run a database company<br>

<MaZe> lol<br><flips> which is why he thinks that<br><flips> the interface is quite simple<br><flips> open with o_direct, make sure your data is aligned<br><shapor> hi all<br><flips> maze, how'd you do with reading your superblock<br>

<MaZe> hey<br><flips> shapor, right on time ;)<br><MaZe> I slept well, thank you ;-)<br><flips> good thing we have logs<br><shapor> yeah<br><shapor> reading now<br><MaZe> I'm going to be working on it now<br>

<flips> maze, that little subproject will be highly instructive<br><MaZe> agreed<br><MaZe> it already has been<br><flips> especially if you write your own custom endio<br><flips> and figure out how to have your task (which is "mount") wait on a wait queue for the io to complete<br>

<MaZe> exactly<br><MaZe> well, it's the in-kernel portion of mount<br><flips> it's all not very much code, but each line takes about 15 minutes of study<br><flips> or maybe an hour the first time<br>

<MaZe> I expect I need something, sleep on something, wake something from endio<br><flips> precisely<br><MaZe> apparently something called a waitqueue<br><flips> the waiting bits are covered in a nice tutorial manner on lwn<br>

<MaZe> so probably something like a dynamic init of a waitqueue<br><RazvanM> ACTION is off to bed. Tomorrow he needs to be early at school.<br><MaZe> then submit io<br><flips> bio is... an acquired taste<br>

<MaZe> then sleep on wq<br><flips> acquired ore<br><flips> acquired lore<br><MaZe> in endio wake wq<br><MaZe> more like acquired love<br><flips> exactly<br><flips> probably using the "wake" function<br>

<MaZe> that sounds awesome<br><MaZe> and either wake or wakeall likely<br><MaZe> here wakeall being more appropriate<br><flips> usually wake<br><flips> no need for a thundering herd<br><flips> of course you know there is only one waiter<br>

<flips> there better not be more, or something else broke<br><MaZe> well, but in general, since the op is complete - I should wake all<br><MaZe> interesting question then is how to dealloc the wq<br><MaZe> must be some put_wq in the waiters<br>

<MaZe> which on last dec to zero does free<br><flips> next move for me is to drop over to whole foods to pick up some munchies<br><flips> I only have a few more days left as a bachelor<br><flips> before the girls get back ;)<br>

<shapor> flips: hah thats where i was instead of class<br><flips> at which time I'm afraid my checking rate will drop somewhat<br><MaZe> linux/wait.h<br><shapor> didn't think it'd be so early<br>

<flips> checkin<br><flips> shapor, 8 pm tue and thur<br><flips> hmm<br><flips> looks like it's too late for whole food<br><flips> unless I really run<br><flips> don't feel like really running<br>

<flips> maybe it's 3rd street for dinner tonight<br><MaZe> so i need to make a dynamic wq, init with init_waitqueue_head()<br><flips> yes, and there are various convenience wrappers<br><flips> best is to write it on the metal the first time<br>

<MaZe> #define wake_up_all(x)                  __wake_up(x, TASK_NORMAL, 0, NULL) <br><flips> well if I don't go shopping there will be no coffee for breakfast<br><MaZe> seems to be the way to wake<br>

<flips> so I'm gone...<br>

</div>