[Tux3] Tux3-University - little generic_write, mostly bio transfer

Jonas Fietz info at jonasfietz.de
Sun Sep 21 12:30:55 PDT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1




05:03 < flips> we started 3 minutes ago
05:03 < flips> no maze
05:04 < flips> so we will take a slight change in session plan
05:04 < flips> instead of doing bio transfers we will continue drilling
down into generic_write
05:05 < flips> ok, somebody summarize where we got to, please... mention
_2copy
05:06  * flips looks at RazvanM
05:06 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2063
05:06 < flips> and the summary?
05:07 < RazvanM> and we got there from here:
http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2319
05:07 < RazvanM> the 2copy is used when there is no support for write_begin
05:08 < flips> what is happening in this function?
05:08 < RazvanM> and we use prepare_Write and commit_write
05:09 < RazvanM> the data is moved to some kernel pages and then to some
user memory? :P
05:09 <@shapor> hi all
05:09 < flips> hi
05:09  * shapor takes a seat at the back of the room
05:09 < flips> the data is moved from user memory onto buffer pages
05:09 < flips> then the buffer pages are committed to disk
05:10 < RazvanM> sorry... I got the order wrong :P
05:10 < flips> 2copy is the lamest name anybody could have possibly
chosen :p
05:10 < flips> appears to be the real thing though
05:10 < flips> just where we should be reading
05:11 < flips> __grab_cache_page is the heart of it
05:11 < flips> other things are decoration
05:11 < flips> such as fault_in_readable
05:12 < RazvanM> just a quick q: why some functions start with uppercase?
05:12 < flips> attempts to deal with the many dangerous recursions
05:12 < flips> with varying degrees of success in terms of robustness
and readability
05:12 < flips> razvanm, random hackers
05:12 <@shapor> what is write_begin?
05:12 < flips> sometimes have studly caps days
05:12 < MaZe> hey
05:13 < flips> write_begin is a hook for some specialized user I don't
know about
05:13 < flips> "completely general interface used inexactly one place"
like as not
05:13 < flips> or "homework for shapor"
05:13 < flips> hey maze
05:13 <@shapor> :)
05:13 <@shapor> ok
05:13 < flips> ok, we can return to the original session plan
05:14 < flips> maze, the plan is for you to report your findings on
basic bio transfers
05:14 < MaZe> lol
05:14 < flips> point to code (you might want to pastie it)
05:14 < MaZe> uhm, lol
05:14 < MaZe> how about I put a tar.gz up?
05:14 < flips> don't copy in the channel unless it's 1/2 lines
05:14 < flips> that too
05:14 < flips> pastie is good, use your taste
05:15 < flips> if you had it checked in you could point a urls
05:15 < flips> so... remember to check in next time ;)
05:15 < MaZe> uploading
05:15 < flips> since you code is so short I'd suggest just pasting the
whole thing
05:16 < MaZe> http://m.a.z.e.pl/junkfs.tar.gz
05:16 <@shapor> lol nice domain!
05:16 < flips> really
05:16 < flips> leet
05:16 < MaZe> yeah, I own z.e.pl
05:17 < MaZe> so I also have m.a at z.e.pl
05:17 <@shapor> heh
05:17 < flips> "opened with ark"
05:17 < MaZe> or m at z.e.pl - whichever you prefer
05:17 < flips> ok, who has got the code open, and who not?
05:17 < MaZe> me not
05:17 < MaZe> ok, got it open
05:18 < flips> ark works pretty fscking well
05:18 < flips> I'm impressed
05:18 < MaZe> mind you - this is very rough, and mostly was debugging
plus getting it working
05:18 < MaZe> I'm still not quite sure of everything, and although I
fixed the last hang bug I found
05:18 < MaZe> I haven't since tested
05:18 < MaZe> so I'm not sure ;-)
05:18 < flips> don't worry, shapor will hurt you if you get anything wrong
05:18 < MaZe> lol
05:19  * shapor wields axe
05:19 < flips> so... where does the bio read setup start?
05:19 < MaZe> do you want me answering?
05:20 < flips> yes
05:20 < flips> you should have been asking ;)
05:20 < MaZe> hmm.
05:20 < MaZe> right
05:20 < MaZe> so pretty much everything except super.c is either
makefile or debug
05:20 < flips> noticed
05:21 < MaZe> and the bottom of super.c is pretty standard module init stuff
05:21 < flips> nicely lindented
05:21 < flips> for the moment we only care about the bio transfer
05:21 < MaZe> and above that is the standard fs registering and fs_ops stuff
05:21 < MaZe> and from there we get to junkfs_get_sb which calls into
get_sb_bdev
05:22 < MaZe> which calls junkfs_fill_super as a callback
05:22 < MaZe> and that's were all the action is
05:22 < flips> action :)
05:22 < MaZe> get_sb_bdev also exclusively opens the block device for
us, so that's nice
05:22 < flips> finally, after 4 days of tux3 U
05:22 < MaZe> at the point we enter into junkfs_fill_super, we have an
exclusively opened block device
05:22 < MaZe> which is passed in the superblock
05:23 < MaZe> sb->s_bdev
05:23 < MaZe> in junkfs_fill_super we then proceed to allocate memory
for 3 basic objects
05:23 < MaZe> 1) memory to read in the 512 byte (SB_SIZE) superblock
05:23 < flips> 1 sector sb, leet
05:23 < MaZe> 2) an object to store state (in the bio->b_private field)
05:24 < MaZe> c) a bio
05:24 < MaZe> 1 and 2 are just normal kmalloc's
05:24 < MaZe> 3 is via bio_alloc
05:24 < MaZe> thus 1 and 2 will need to be kfree'd
05:24 -!- Bushman [~marcin at c-76-23-106-132.hsd1.sc.comcast.net] has
joined #tux3
05:24 < MaZe> and 3 will need to be bio_put'ed at some point before the
end of junkfs_fill_super
05:24 < MaZe> or we'll leak
05:24 < MaZe> anyway, standard handling of error returns on all the allocs
05:25 < MaZe> and we get to:
05:25 < MaZe> bio->bi_bdev = sb->s_bdev;
05:25 < MaZe> <------>bio->bi_sector = 0; // first sector
05:25 < MaZe> <------>s = bio_add_page(bio, virt_to_page(buf), SB_SIZE,
offset_in_page(buf));
05:25 < MaZe> which is most of the bio preparation stage
05:25 <@shapor> Bushman: hi Marcin
05:25 < flips> the real meat
05:25 < MaZe> we set the bio to refer to the correct block device
05:25 < flips> marcin, hi
05:25 < MaZe> and (for now - this is all junkfs ;-) ) we just read the
first sector
05:25 < MaZe> sectors in new linux are always exactly 512 bytes
05:25 < flips> that's leet nuff for us
05:26 < MaZe> so we're saying here offset 0 * 512 into the block dev
05:26 < MaZe> then we need to tell the bio where to store the data
05:26 < MaZe> (or read from, since a write would be identical)
05:26 < flips> right, struct bio is sector-addressed for no good reason
05:26 < MaZe> s = bio_add_page(bio, virt_to_page(buf), SB_SIZE,
offset_in_page(buf))
05:26 < Bushman> hello Daniel
05:27 < MaZe> this actually gives our carefully allocated memory to the
bio as memory
05:27 < flips> bushman, enjoy ;)
05:27 < MaZe> note that bio_add_page takes (bio, struct page*, len, ofs)
05:27 < Bushman> i dunno if enjoy is the right word for kernel code just
before bedtime ;)
05:27 < MaZe> so we pass in the bio, then convert the bufs address to a
page via virt_to_page
05:27 < flips> and you could write it out in full in about as much code
as the function call takes
05:27 < MaZe> pass the length of the block
05:28 < MaZe> and calc the offset from the page struct for the ofs via
offset_in_page
05:28 < flips> bushman, then just enjoy the geek banter
05:28 <@shapor> virt_to_page?
05:28 < MaZe> I'm assuming at this point that a kmalloc can't give us
memory split across pages
05:28 < MaZe> - not sure if this is correct
05:28 < flips> shapor, great question
05:28 < flips> maze, correct
05:28 < MaZe> so buf was kmalloc'ed, so it's a virtual kernel memory address
05:29 < flips> maze, unless the kmalloc is bigger than a page
05:29 < MaZe> virt_to_page gives us the struct page * for the kaddr we
pass to it
05:29 < MaZe> [flips: of course]
05:29 < flips> maze, and why do we need the struct page?
05:29 < MaZe> because that's what bios want
05:29 < MaZe> if you look at what a bio is
05:29 < MaZe> it's 3 things
05:29 < MaZe> the struct bio
05:30 < MaZe> which has a lot of management fields
05:30 < MaZe> the bvec which
05:30 < MaZe> is an array of a tiny struct with 3 fields
05:30 < MaZe> { struct page * p; int len; int ofs; }
05:30 < MaZe> so basically a list of where to put the next len bytes,
specifying memory via page/ofs pairs
05:31 < MaZe> this is for two reasons:
05:31 < MaZe> [at least as far as i can tell]
05:31 < MaZe> a) most hw (ie. stuff the blockdevice drivers care about)
05:31 < MaZe> cares about physicall addresses and not virtual kernel
addresses
05:31 < flips> right
05:31 < MaZe> ie. for dma and all that good for performance goodness
05:31 < MaZe> b) this can also be used for data xfr into userspace
05:32 < MaZe> and there is no guarantee userspace memory has a mapping
into kernel space
05:32 < MaZe> [high mem]
05:32 < flips> the big reason: scatter gather
05:32 < flips> this is a dma interface in disguise
05:32 < flips> very effective one
05:32 < MaZe> this also makes it easier to coallesce physically
neighboring memory together into the bvecs
05:32 < MaZe> precisely
05:32 < flips> right, another way of saying scatter gather
05:33 < MaZe> notice that in bio_alloc
05:33 < MaZe> we passed in a 1
05:33 < MaZe> that 1 is the number of bvecs in the bvec area allocated
to the bio
05:33 < MaZe> so that limits how many non-contig pieces of memory we can
have in the bio
05:33 <@shapor> ah
05:33 < MaZe> here - all we need is 1
05:33 < flips> and because you did that, you could have initialized your
one bvec with a simple structure assignment
05:33 < flips> instead of the function call
05:33 < MaZe> right.
05:33 < flips> which does a bunch of stuff you don't need
05:34 < MaZe> oh well.
05:34 < RazvanM> does a bio_vec describes exactly one page?
05:34 < flips> maze, exactly
05:34 < MaZe> no
05:34 < RazvanM> bv_len
05:34 < MaZe> it describes a start page with ofset and a length
05:34 < MaZe> the length may exceed that page and cross into however
many next ones
05:34 < MaZe> the precise rules for merging are overridable
05:34 < flips> it describes a data region that resides within one page
05:34 <@shapor> so the bio interface will be quite good for extents
05:35 < MaZe> many device drivers have limits on how many sectors they
can transfer in one go (ie. 200 or so)
05:35 < flips> maze, you can't cross a page with a bvec
05:35 < MaZe> flips, you sure?
05:35 < flips> sadly, or  perhaps sanely
05:35 < MaZe> I certainly ain't ;-)
05:35 < flips> pretty sure
05:35 < MaZe> but then I don't know what I'm talking about here
05:36 < flips> never seen it done ;)
05:36 < MaZe> these are still all guesses
05:36 < Bushman> pollacks ain't sane, just ask Shap
05:36 < MaZe> I thought they merged by themselves
05:36 < MaZe> hmm, well, first homework I;d guess
05:36 < RazvanM> one more q: bv_len is counting bytes or sectors? :P
05:36 < flips> merging happens in the physical driver
05:37 < flips> good question
05:37 < MaZe> anyway bio_add_page returns how much it successfully added
(or what the current total is, not sure) in bytes
05:37 < flips> bytes I think
05:37 < MaZe> so if everything is good it should be 512 at this point
05:37 < MaZe> hence the check
05:37 < flips> it's pretty badly braindamaged i that respect, counting
in different units for no good reason
05:37 < MaZe> if it doesn't match, we've got a problem - which mind you
- - AFAICT - can't happen
05:37 < MaZe> and we bio_put to free the structure and basically error out
05:38 < MaZe> [of course here we always error out, because this is
junkfs (tm)]
05:38 < MaZe> anyway if s==512 then we're good
05:38 < flips> oh bv_len is definitely bytes
05:38 < MaZe> we setup to more fields in the bio
05:38 < MaZe> bi_end_io is the call back for when the bio is processed
(or errors out)
05:39 < flips> when the disk completion interrupt fires
05:39 < flips> key point
05:39 < MaZe> bi_private is a pointer to our data (the mz struct) so
that we can figure out what we're talking about in the endio handler
05:39 < MaZe> and then we submit the bio for READ
05:39 < MaZe> now this (ie. bios) are inherently asynchronous
05:39 < MaZe> so at this point it might have already completed - it
could have been cached and come back immediately
05:39 < flips> right... it's the _only_ way to recover a memory context
for a completed bio
05:40 < MaZe> [I think]
05:40 < MaZe> or we might need to wait some indeterminate amount of time
05:40 < flips> it's much more direct than that
05:40 < MaZe> here's where we make use of the waitqueue which we
helpfully placed in the mz struct
05:40 < flips> disk raises interrupt -> endio gets called
05:40 < flips> in interrupt context
05:40 < flips> this is as on the metal as you will get without going
hypervisor
05:41 < MaZe> oh, so basically end_io should do as little as feasibly
possible
05:41 < MaZe> preferably as simple as it is here
05:41 < flips> yes
05:41 < flips> again yes
05:41 <@shapor> is it the right place to call bio_put ?
05:41 < flips> though I often get excessive there ;)
05:41 < MaZe> anyway, earlier on, we'd already initialized the
waitqueue, so now we can just wait on it
05:41 <@shapor> in the endio handler?
05:41 < MaZe> except wait needs not only a waitqueue (wq) but also a
condition
05:42 < MaZe> [which it checks _first_]
05:42 < flips> maze, _interruptible?
05:42 < MaZe> hence mz struct also contains a boolean
05:42 < MaZe> flips: yeah, no idea what the right choice is there,
meaning to ask about this
05:42 < flips> shapor, yes
05:42 < flips> very important question
05:42 < Bushman> flips, so how would it behave in a hypervisor?  any
changes?  does it lose determinism?
05:42 <@shapor> why does it matter?
05:42 < flips> if interruptible, you better be prepared to field
anything that can be thrown at you
05:43 < flips> if uninterruptible, you'd better be able to prove it
always completes
05:43 < Bushman> is that the basis for atomicity then?
05:43 < MaZe> so what could get thrown at us, and will the bio always
complete?
05:43 <@shapor> flips: what happens if there is an error
05:43 < flips> bushman, we don't touch hypervisors
05:43 <@shapor> disk io error or something
05:43 < flips> if we did, it would be to implement hard realtime or
something
05:43 < MaZe> hypervisors should be transparent to the os
05:43 <@shapor> does the endio handler get called?
05:44 < MaZe> yes endio has err parameter
05:44 < flips> bushman, there is some sense of atomicity here in the
interruptible/noninterrupble distinction
05:44 < flips> loose sense
05:44 < MaZe> just to finish off this (junkfs_fill_super) function, we
then dump the superblock via printk and free everything and return an
error (junkfs remember.?)
05:44 < flips> maze, in kernel interrupts don't just happen, you have to
ask for them
05:44 < MaZe> even with preemption
05:44 < MaZe> ?
05:45 < flips> or they get fielded on syscall exit
05:45 < Bushman> SHOULD be transparrent, but since most of them mangle
time into nonlinear, doesnt it screw up our predictions when interrupt
is gonna finish?
05:45 < flips> task switch is not interrupt
05:45 < flips> it's caused by an interrupt
05:45 <@shapor> oh i see you just aren't checking the err parameter in
end_io_read
05:45 < flips> you can get a task switch even with wait_uninterruptible
05:45 <@shapor> probably should ;)
05:45 < MaZe> so while in kernel space, my thread of execution is
guaranteed not get interrupted by anything?
05:45 < MaZe> right I should ;-)
05:45 < flips> all that means is, an interrupt won't cause the wait to
bail early
05:46 < flips> you have to wrap your interruptible wait in a loop
05:46 < flips> or write uninterruptible
05:46 < MaZe> so interruptible here refers to what? can be interrupted
by killing the mount process?
05:46 < flips> which is probably what you want here
05:46 < flips> just means the wait may bail before the wak
05:46 < flips> wake
05:47 < flips> so has to be in a loop, and you can't assume that what
you were waiting for actually happened
05:47 < Bushman> so i guess the big question here is how do we guarantee
that the write is gonna complete?
05:47 < MaZe> so I'd want uninterruptible? or interruptible and then on
some interrupts somehow cancel and free the bio
05:47 < flips> just write uninterruptible until you know kernel
scheduling better ;)
05:47 < MaZe> (read here)
05:47 <@shapor> uninterruptable will cause it to be D too iirc
05:47 < flips> bushman, it always completes
05:47 <@shapor> D state
05:47 < flips> with or without an error
05:47 <@shapor> Bushman: it may complete with an error
05:48 <@shapor> which gets passed to the endio handler
05:48 < flips> yes, this is d state, the real thing
05:48 < MaZe> which as written ignores all errors, and just marks the io
as completed, frees the bio, and wakes the wq
05:48 <@shapor> interruptable is not quite so severe i guess
05:48 < flips> you are in d state any time you're waiting in kernel
05:48 <@shapor> even interruptable?
05:48 < flips> yes
05:48 < MaZe> unless you're doing wait_interruptible?
05:49 < flips> hmm
05:49 <@shapor> flips: didn't we find that not to be the case
05:49 <@shapor> with ddsnap
05:49 < flips> even then I think
05:49 < MaZe> hmm, so how could I get this to be abortable, in case for
example the block device hangs on network?
05:49 <@shapor> remember our threads were all D state
05:49 < flips> you get a qualifier on your ps output
05:49 <@shapor> until we changed it to interruptable
05:49 < flips> maze, that's not your job, it's the job of the device
insert/remove
05:50 < flips> which of course means it's badly mismanaged ;)
05:50 < flips> but...
05:50 < flips> not your problem for now
05:50 < MaZe> well what if we're running this off of a nbd or something
like that, and the network gets pulled
05:50 < MaZe> would the bio then just (eventually) return with an error
to endio?
05:50 < flips> that's nbd's problem
05:50 < flips> again not yours
05:51 < flips> you can try to do timeouts and things, but you're risking
redudancy
05:51 < flips> and confusion
05:51 < MaZe> right
05:51 <@shapor> risking redundancy ?
05:51 < flips> duplicating functionality that is better performed at
some other layer
05:52 < flips> constant risk with the blind leading the blind ;)
05:52 <@shapor> yeah
05:52 <@shapor> good point
05:52 <@shapor> but the blind leading the deaf is ok
05:52 < flips> maze, that was a great walkthrough, and the code is great too
05:52 <@shapor> yes!
05:52 < flips> not perfect, but you don't need that to be great in linux ;)
05:52 < MaZe> I stil don't quite understand a bunch of it
05:52 <@shapor> MaZe: thanks, i was following closely with little time
to type
05:52 < flips> a few warts make it more real, like a european movie
05:53 <@shapor> hah
05:53  * Bushman rolls eyeballs
05:53 < MaZe> lol
05:53 < flips> maze, I am going to cut and paste your code into
fs/tux3/super.c
05:53 < flips> and tux3 is going to read a leet sector sized sb too
05:53 <@shapor> heh
05:54 <@shapor> s/junkfs/tux3/
05:54 < MaZe> hehe
05:54 < flips> exactly
05:54 < flips> or s/tux3/junkfs/
05:54 < flips> depending on leetness or lack of it
05:54 <@shapor> so it seems silly for every fs to have to do this
05:54 <@shapor> is the vfs totally useless?
05:54 < flips> yes
05:54 < flips> pretty much
05:54 < MaZe> what I still haven't found is how to specify the io
priority of the bio you submit
05:54 < flips> pretty close
05:54 < flips> not completely
05:55 < flips> lame but not useless
05:55 < flips> better than NT
05:55 < MaZe> I'm assuming it inherits from the ionice'ness of the
process in whose context you're running
05:55 < flips> maze, completely separate
05:55 < flips> it's part of the elevator abstraction
05:55 <@shapor> oh?
05:56 < MaZe> huh?
05:56 <@shapor> i was wondering that too
05:56 < flips> inheriting anything is completely a property of the
elevator plugin
05:56 < MaZe> shouldn't submitting a read/write request to a blockdevice
be exactly when this matters?
05:56 < flips> see "request queue"
05:56 < MaZe> oh, the mysterious q parameter
05:56 < flips> one of the harder code reading projects in kernel
05:56 < flips> it's a mess
05:56 < MaZe> I saw all over the place
05:56 < MaZe> that is apparently a field in the bio struct
05:57 < flips> q is a carpet under which all kinds of doggie poo is swept
05:57 < flips> it's really a bag tied onto the side of the bio
05:57 < flips> we'll get rid of it before next christmas
05:57 < flips> I hope
05:57 < MaZe> I just want a nice aio read/write with priority interface
for my coding
05:57 < flips> you got it
05:57 < flips> already
05:58 < flips> well s/nice/nicer than what we had before/
05:58 <@shapor> that would be a good project.. a new aio interface
05:58 < MaZe> right, I have the aio rw
05:58 <@shapor> sounds like it should map easily enough....
05:58 < flips> bio transfer is aio at its purest
05:58 <@shapor> yeah
05:58 < MaZe> right, but you want prioritization in there
05:58 <@shapor> should be easier than non aio realy
05:58 < MaZe> and that's what I'm failing to see
05:58 < flips> maze, in the elevator
05:58 < Bushman> 'scuze my newbness, but wouldnt priority be at odds
with queuing that the controllers try to do?
05:58 < MaZe> so does the bio go through the elevator?
05:59 < flips> bushman, interactions, yes
05:59 < flips> not all good
05:59 < MaZe> well, you want something htb like for io
05:59 < flips> best to try and harmonize with them
05:59 < MaZe> wait a minute, what's the layering here?
05:59 < MaZe> is the physical hw under the elevator under the bio
05:59 < flips> vfs <-> bio <-> driver
06:00 < MaZe> and where's the elevator?
06:00 <@shapor> between bio and driver
06:00 < flips> vfs <-> bio <-> elevator <-> driver
06:00 <@shapor> right?
06:00 < MaZe> vfs <-> bio <-> elevator <-> driver
06:00 < MaZe> ?
06:00 < flips> heh
06:00 <@shapor> heh
06:00 < flips> exactly
06:00 < MaZe> so by choosing the request queue in the bio, I choose
priority of the request with regards to other requests?
06:00 < flips> and the presence/lack of the elevator is up to the driver
or virtual driver even
06:01 < flips> so the elevator can appear at multiple or no places in
the stack
06:01 <@shapor> so the elevator messes with fields in the bios?
06:01 < MaZe> is this screwy? or is this just me...?
06:01 < flips> and vice versa in an idiotic way... sometimes useful way
06:01 < flips> maze, it's screwy
06:01 < flips> not just you
06:01 < flips> but better than we had in 2.4
06:02 < flips> it's damn fast actually, compared to a disk
06:02 < flips> we didn't have that a few years ago
06:02 < flips> now it's looking slow again
06:02 < flips> and people are asking me to fix it
06:02 < flips> it shall be done
06:02 < MaZe> wait a minute - what is slow?
06:03 < MaZe> the interfaces / kernel code?
06:03 < flips> this who kooky chain
06:03 < flips> whole
06:03 < flips> vfs <-> bio <-> elevator <-> driver
06:03 < flips> layering is right
06:03 < flips> implementation is faulty
06:03 < MaZe> agreed
06:04 < flips> anyway
06:04 < flips> we're using the existing one for now
06:04 < flips> it will work for tux3 as well as it works for anybody
06:04 < flips> better, because we will use it more directly
06:04 < flips> and have fewer strange waits and so on
06:04 < MaZe> right
06:04 < flips> and when we do see a strange wait, we will be able to
pounce on it
06:04 < MaZe> that's why I wanted to go all the way down to the bio on
the sb read
06:04 < MaZe> a) for practice
06:05 < MaZe> b) because it's the way it should be done
06:05 < flips> unlike if you use the... odd... vfs block io helpers
06:05 < flips> well I think we are going to stay all the way down here
for tux3
06:05 < flips> tux3 has no use asking other subsystems to submit bios on
its behalf, unless that subsystem is an lvm
06:06 < flips> and even then, we just submit a bio to the lvm without
caring its not a real device
06:06 < MaZe> still have to figure out how to do mmap like stuff (ie.
trigger read in, on page fault, or write out, both for kernel and
userspace, and cow, etc)
06:06 < flips> maze, handled for you
06:06 < flips> like magic
06:06 < MaZe> cool - assuming it does the right thing (tm)
06:06 < flips> see filemap.c -> nopage
06:06 < flips> kinda right
06:06 < flips> some messed locking
06:06 < MaZe> which I'm not sure it does for cache coherency netfs
06:07 < flips> bottlenecks on i_mutex during fault in
06:07 < flips> bad
06:07 < MaZe> so it probably needs to be gone through with a fine comb then
06:07 < flips> even nfs is cache coherent/consistent with respect to mmap
06:07 < MaZe> as I was expecting
06:07 < flips> yes
06:07 < flips> right in to the danger zone
06:08 < flips> speaking of which
06:08 <@shapor> what bottlenecks on i_mutex?
06:08 < flips> time to turn on the ghetto blaster
06:08 < flips> and get back to coding
06:08 < MaZe> I'm assuming the code in filemap.c which deals with
page-in/outs of mmapped pages
06:08 < MaZe> oh, right it's already 10 past 9
06:08 < MaZe> so is that it for this time?
06:08  * flips puts on Holst's the planets, performed by korean rock band
06:08  * shapor scrolls back to remember his homework
06:09 < flips> that's it, nice one maze
06:09 < Bushman> is anybody sticking around to ask lame(er) questions?
06:09 < flips> next time it will be razvanm's turn
06:09 < RazvanM> :P
06:09 < MaZe> oh, awesome, what's he doing?
06:09 < flips> to explain some more of _2copy
06:09 < MaZe> ah

- -- begin of open questions

06:09 < flips> lame question period is officially open
06:10 < flips> intelligent questions banned
06:10 < Bushman> what's an elevator?
06:10  * RazvanM doesn't have anything to ask this time
06:10 < flips> a kernel elevator
06:10 < MaZe> when you read/write data to a hard disk
06:10 < flips> otherwise you're going to get some dumb jokes
06:10 < MaZe> which is a spinning platter with a seeking head
06:10 <@shapor> elevator = io scheduler
06:10 < MaZe> then depending on the order you send out request
06:10 < tim_dimm_> just caught up
06:10 < MaZe> you may need to do a small or large number of seeks
06:10 < tim_dimm_> like tivo for geeks
06:10 < flips> yup, and it's algorithms are the same as a busy elevator
in a skyscraper
06:10 < MaZe> seeks are very expensive
06:11 < MaZe> so you try to minimize seeks
06:11 < MaZe> for good performance (b/w), but higher latency
06:11 < tim_dimm_> so are tlb misses
06:11 < flips> and page cache misses
06:11 < MaZe> you basically scan the disk from top to bottom, doing read
writes at increasing lba addresses
06:11 < MaZe> irregardless of the order they were submitted in
06:11 < MaZe> then do the same thing going downwards
06:12 < flips> somewhat downwards
06:12 < Bushman> ok great, but from this level, can we be aware of what
media we're writing to so we dont make it overinvolved in cases it
doesnt matter, like solid state disks?
06:12 < MaZe> right
06:12 < flips> the disk doesn't like going backwards as much as forwards
06:12 < MaZe> the consecutive read/write sectors are still upwards
06:12 <@shapor> Bushman: you can pick an io scheduler on a
per-block-device basis
06:12 < MaZe> and sometimes you skip the backwards step entirely
06:12 < MaZe> depends
06:12 < flips> bushman, mostly we don't care, where we do care we care a lot
06:12 < MaZe> lots of fine tuning required to get optimal performance
06:12 < MaZe> and it heavily depends on usecases
06:13 <@shapor>  /sys/block/sda/queue/scheduler
06:13 < Bushman> as long as it's adjustable from userspace i'm good ;)
06:13 < MaZe> plus you can throw in individual io priorities into the
mix (ie. reading this sector is more important)
06:13 < flips> we try to design for whole classes of usecases, rather
than one at a time
06:13 < MaZe> and b/w per job, and hard read/write deadlines, etc
06:13 < MaZe> and it all gets complex
06:13 <@shapor>
http://friedcpu.wordpress.com/2007/07/17/why-arent-you-using-ionice-yet/
06:13 < Bushman> shapor, nice, i havent gotten used to the new linux,
i've been bsd'ing since '03
06:13 <@shapor> i only recently discovered ionice
06:13 < MaZe> and the elevator is the piece of code which gets requests
thrown at it
06:13 <@shapor> i think mentioned on here
06:14 < MaZe> does some algo mumbo jumbo to put them in the 'best' order
06:14 < flips> shapor, because it doesn't work that well?
06:14 < MaZe> and throws them at the disk
06:14 <@shapor> flips: yes but the interface is there
06:14 <@shapor> if people use it they can report bugs
06:14 < flips> sure
06:14 <@shapor> if people dont report bugs or say it sucks on lkml it
wont get fixed
06:14 <@shapor> same problem with posix_fadvise
06:14 < MaZe> note that for a network nic
06:14 < flips> we will take it for a spin at some point
06:14 < MaZe> you have a certain amount of b/w
06:14 < flips> maze will ;)
06:15 < MaZe> and it's all pretty easy - conceptually
06:15 < flips> and shapor will make some nice charts of the event logs
06:15 < flips> vfs + bio events
06:15 <@shapor> oh i almost forgot about that
06:15 < MaZe> sending each packet involves a fixed amount of headroom,
(header fields), the packet itself, and a fixed footer
06:15 <@shapor> still no clue how to glue those together
06:15 < MaZe> so when you send a packet you know exactly how much of the
nic (ie. for how long) you're using it up
06:15 < MaZe> thus you can make very nice guarantees
06:16 < MaZe> and this is what htb + sfq does for networking
06:16 <@shapor> htb? sfq?
06:16 < MaZe> you can partition your network card pretty much
arbitrarily between diifferent apps
06:16 < MaZe> giving different apps different priorities, then different
priorities different amounts of bw
06:16 < MaZe> and the priorities don't need to be strictly linear either
06:16 < flips> htb? sfq?
06:16 < MaZe> htb
06:16 < Bushman> oh could i get in on the testing?  i've done a lot of
work visualizing sequences of events in temporal OSPF loops, this should
be i could do ;)
06:16 < MaZe> htb is basically a tree structure
06:17 < MaZe> the nodes are were requests come in
06:17 < flips> what's the tla mean?
06:17 < MaZe> the root is were requests come out
06:17 < MaZe> so each application (or tcp stream, or whatever you're
using) gets assigned to a leaf node in this tree
06:17 < flips> (Stochastic Fairness Queueing)
06:17 < MaZe> and the network driver then (when it wants to send) always
pulls from the root
06:18 < flips> gah
06:18 < MaZe> each node in this tree has a certain speed of accumulating
tokens
06:18 < MaZe> (htb = hierarchical token buckets)
06:18 < MaZe> that it accumulates in the bucket in that node
06:18 < Bushman> wouldnt stochastic approach that every client is
equally unhappy? ;)
06:19 < MaZe> Bushman: sfq is used in the leafs to randomly select
between clients / tcp streams you consider equivalent
06:19 < MaZe> you hang an sfq off of each leaf node in htb, so you
actually throw the packets at the correct sfq, and the htb leaf pulls it
from the attached sfq
06:19 < flips> network peeps are always reinventing the world ;)
06:19 < Bushman> ah, so you use the hiarchical token buckets to assign
different classes of service to different apps/streams?
06:19 < MaZe> anyway, you divide up each nodes bandwidth among it's children
06:20 < MaZe> and then define how and when they can borrow/lend tokens
to each other
06:20 < MaZe> I'm not doing a very good job of defining it here
06:20 < MaZe> but it's wicked!
06:20 < tim_dimm_> no- you're doing a great job
06:20 < flips> maze, I'm getting the idea
06:20 < tim_dimm_> sounds wicked
06:20 < Bushman> yea i just did a project with filtering/limiting at
work, so i'm getting it
06:21 < Bushman> it sounds a lot smarter than it is ;)
06:21 < flips> well, disk layer doesn't have any such pretentions to
sophistication
06:21 < flips> yet
06:21 < tim_dimm_> heh
06:21 < Bushman> damn academis justifying their existence
06:21 < MaZe> anyway, basically htb + sfq is the best I've seen for
networking, and would probably be awesome for other stuff as well like
scheduling cpus
06:21 < flips> I can imagine the mess if it did
06:21 <@shapor> Bushman: gee filtering and limiting, i wouldn't have
guessed :P
06:21 < MaZe> except it's probably to compute intensive for that and
can't take cache-heat or memory nearness into account
06:21 < Bushman> shapor:  stfu ;)
06:21 < flips> :)
06:22 < MaZe> anyway, with disk it gets tougher
06:22 < tim_dimm_> if it did, could be interesting as a cache coherency
protocal
06:22 < MaZe> because you can't just up and calculate how long a
particular operation will take
06:22 < flips> network peeps always trying to find the must obscrue TLA
06:22 <@shapor> Bushman: don't you guys use bullets for limiting ? :P
06:22 < flips> mot <- most obscure tla
06:22 <@shapor> haha
06:22 < MaZe> (with the nic, you know its line rate, you know how many
bytes your sending, the size of the pre and post-amble, the wait between
packets, you thus now the _entire_ cost of sending any given packet]
06:22 < Bushman> dont make me whip out stories about invalidating keys
with thermite granades
06:22 < tim_dimm_> motley cru
06:23 < MaZe> tla?
06:23 < MaZe> mot?
06:23 < flips> maze, and you don't know much carrier sense backout is
going to cost ;)
06:23 <@shapor> most obscure three letter acronym
06:23 < MaZe> ah, so you use the hiarchical token buckets to assign
different classes of service to different apps/streams? - precisely
06:23 < flips> and that's where your pretentions to realtime control
come crashing down
06:23 <@shapor> which is a fla
06:24 <@shapor> which is a tla
06:24 <@shapor> which is a tla
06:24 < flips> third time lucky
06:24 < MaZe> for example I would give each user in my network their own
sfq for local traffic to another nic (just switching) to another network
via wireless and to the internet (via the same wireless)
06:24 < Bushman> to make delivery time guaranteed, woudlnt you have to
have full preempt kernel?  (oh i miss 80ties Amigas)
06:24  * flips thinks of some keys he'd like invalidated
06:24 < MaZe> and then use htb to make sure everything was fair on the
slow internet link, and on the others at the same time - worked awesome
06:25 < MaZe> be right back in 10.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkjWoO8ACgkQydrGfzV1md20igCg7GnJrsik45uVvCqX1i4QN7Q8
VkEAmgLDkQOSpb3bS0Hi/mpxc+UTstgD
=oZ7Z
-----END PGP SIGNATURE-----

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list