04:55 < flips> who's here and awake? 04:56 < MaZe> I am ;-) 04:56 < flips> that's a quorum 04:56 < MaZe> mmm, sake and sushi... 04:57 < flips> was good 04:57 < flips> well 04:57 < flips> I better heat up another one 04:58 * RazvanM is also awake 04:59 < flips> got a browser ready? 05:00 < MaZe> always... 05:00 < RazvanM> lxr? :P 05:00 < flips> http://lxr.linux.no/linux <- ok, open this 05:00 < flips> of course 05:00 < MaZe> as expected 05:00 -!- RalucaME [~ral@pool-151-196-118-156.balt.east.verizon.net] has joined #tux3 05:00 -!- nataliep [~nataliep@72.14.224.1] has joined #tux3 05:00 < flips> hi natalie 05:00 < nataliep> hi dan 05:00 < flips> max, have you met natalie? 05:01 < flips> maze? 05:01 < MaZe> flips: http://lxr.linux.no/linux <- ok, open this 05:01 < MaZe> hmm? 05:01 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/open.c#L1106 <- let's start with sys_open 05:02 < flips> everybody see it? 05:02 < MaZe> yes - we should perhaps first ask though how many people are listnening 05:02 * nataliep nods 05:02 * RazvanM nods too 05:02 < flips> 3 is fine with me 05:02 * MaZe nods sagely 05:02 < RalucaME> nods :) 05:02 < flips> it's logged anyway 05:03 < MaZe> true 05:03 < flips> ok, every syscall in linux starts with sys_ 05:03 < flips> and continues with the name you get from man 05:03 < flips> so man 2 open 05:03 < flips> all it does is a little linkage 05:04 < flips> then real action starts in do_sys_open 05:04 < flips> so lets go there by clicking on it 05:04 < flips> and click on the Function link 05:04 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/open.c#L1084 05:04 < MaZe> why isn't sys_open isn't just a call to sys_openat(AT_FDCWD, ...)? 05:04 < flips> we're still in the same file 05:04 < flips> good question 05:05 < flips> ask al viro and add some epithet on the end ;-) 05:05 < MaZe> can sys_* never call sys_* ? 05:05 < MaZe> or is this something that could be cleaned up? 05:05 < flips> syscalls use a weird linkage 05:05 < flips> gcc and do it, but its odd 05:05 < flips> can do it 05:05 < MaZe> sys_creat calls sys_open 05:05 < MaZe> so it could probably be replaced 05:06 < flips> I open call sys_ functions from deep in kernel 05:06 < flips> often 05:06 < flips> let me recant 05:06 < flips> syscalls _sometimes_ use a weird linkage 05:06 < flips> so yes you could nest them 05:06 < flips> al doesn't for no reason I know 05:07 < flips> it's like "yuck, is a nasty top level entry point" 05:07 < MaZe> gcc and do it, but its odd - what did you mean? 05:07 < flips> I was rambling 05:07 < MaZe> ;-) 05:07 < flips> the weird stuff happens before we even get there 05:07 < flips> in the syscall table 05:08 < MaZe> so by the time we hit sys_* we're in pure C land? 05:08 < flips> #ok, so we're in a different address space than we were a nanosceond 05:08 < flips> yes 05:08 < flips> usually 05:08 < flips> some syscalls have strange register linkage 05:08 < flips> anyway the vfs doesn't much care about that 05:08 < flips> it gets away from syscall land as soon as it can 05:09 < flips> what we are going to see, is a lot of messing around with user addresses 05:09 < flips> because a nanosecond ago or so, we were in processor ring 3 05:09 < flips> userspace 05:09 < flips> now we're in ring 0 05:09 < flips> different address space 05:09 < flips> kind of 05:10 < MaZe> different priveledge level 05:10 < flips> that too 05:10 < flips> everthing is a little different 05:10 < flips> kind of like the twilight zone 05:10 < flips> we're on the inside of the glass looking out now, like in that harry potter movie 05:10 < flips> ok 05:10 < flips> so we have to get the name for the open 05:10 < flips> it's in a different address space 05:11 < flips> so we do copy_from_user to get it 05:11 < flips> just looking for the def of getname 05:11 < flips> it's not very interesting actually 05:11 < flips> it stores the name on a full page of kernel memory 05:12 < flips> or it used to 05:12 < flips> now I see we use a kmem cache for it 05:12 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/fs/namei.c#L141 05:12 < flips> http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L1615 05:12 < flips> thanks 05:12 < MaZe> and an audit hook 05:13 < flips> things change around in here fairly frequently 05:13 < flips> it's usually worth starting from the top in lxr every time 05:13 < flips> just so you can check for details that changed 05:13 < MaZe> by the top you mean all the way from sys_open or some other top? 05:13 < flips> that audit thingy is new 05:13 < flips> right 05:14 < flips> like I said, getname is boring 05:14 < flips> let's go back to do_ _open 05:14 < MaZe> perhaps, another question: how much of a fs is driven by userspace triggered syscalls? 05:14 < flips> nearly all of it 05:14 < flips> particularly for traditional fs's 05:14 < flips> new ones tend to have some daemons helping 05:15 < flips> generally, the more daemons, the less reliable 05:15 < MaZe> which are effectively kernel threads doing syscalls? 05:15 < flips> not doing syscalls 05:15 < flips> using internal interfaces 05:15 < flips> using syscalls internally sucks, because of being in the wrong address space 05:15 < flips> the syscall expects to get its data from userspace 05:15 < MaZe> oh, right the copy_from_user stuff 05:15 < flips> right 05:16 < flips> anyway you're using syscalls internally, something linux is broken 05:16 < flips> or you're stupid 05:16 < MaZe> heh 05:16 < flips> about 50/50 05:16 < flips> the next interesting place is do_filp_open 05:17 < flips> lxr is a little funky indexing some of these 05:17 < MaZe> would do_filp_open be the kernel-internal interface to open? 05:18 < flips> factoring is a little arbitrary 05:18 < MaZe> [ie. would this be what you would call from above mentioned kernel threads/daemons?] 05:18 < flips> it's another helper that happens to do almost all the work 05:18 < flips> yes you can 05:18 < flips> if its not static, then something is using it 05:18 < flips> often something bogus 05:18 < flips> something external that is 05:18 < flips> not statie = part of kernel api 05:19 < flips> often unwisely ;-) 05:19 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/namei.c#L1761 05:19 < flips> I wish lxr was smarter about finding defs of extern functions 05:19 < flips> I went to usage, then to the first reference onthe list 05:19 < flips> now things are happening 05:20 < MaZe> [actually filp_open seems to be the kernel-interface, not that it much matters] 05:20 < flips> that is true 05:20 < flips> see "arbitrary factoring" above 05:20 < flips> it's kind of a pile is some ways ;-) 05:20 < flips> in other ways it's beautiful 05:20 < flips> only about 3 of those ;-) 05:21 < flips> we'll get to some scary code now 05:21 < RazvanM> do_filp_open is pretty big... 05:21 < flips> path_lookup_open 05:22 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/fs/namei.c#L1238 05:22 < flips> it;'s big because it's implementing all of unix semantics + all of linux semantics + historical cruft + arcane voodooism nobody is quite sure about 05:23 < MaZe> so the vfs layer does permissions checking... not the fs itself? 05:23 < flips> we're going to stay away from path lookup to avoid brain damage 05:23 < flips> that is correct 05:23 < flips> the vfs checks permissions and does a lot of locking too 05:23 < flips> also implements the namespace caching 05:23 < flips> it does a huge amount of work 05:24 < MaZe> what's namespace caching? 05:24 < flips> dentry cache 05:24 < flips> every time you open a file, linux creates a dentry for the name 05:24 < flips> that lives in cache 05:24 < flips> dentry points at inode 05:24 < MaZe> so what does the dentry cache map between? 05:24 < flips> dentries are pretty big, inodes are pretty big memory structures too 05:25 < MaZe> filename and inode (possibly lack of inode)? 05:25 < flips> the dentry maps filename -> inode 05:25 < flips> in cache 05:25 < MaZe> does filename include full path? 05:25 < flips> only when tere is a miss in the dentry cache does the vfs go to the filesystem 05:25 < flips> no, not the full path 05:25 < RazvanM> Important sizes: 05:25 < RazvanM> block 1024 05:25 < RazvanM> inode 300 05:25 < RazvanM> dentry 128 05:25 < RazvanM> bh 56 05:25 < RazvanM> kmem_cache 12 05:25 < flips> the parent inode and the filename 05:25 < MaZe> relative to the fs root? 05:25 < MaZe> ah 05:25 < flips> razvanm, nice 05:26 < RazvanM> flips might want to correct me ;-) 05:26 < flips> so every time you open a file, you get a dentry+inode+file 05:26 < flips> already a lot of cache memory 05:26 < flips> for a tiny thing maybe 05:26 < flips> with 6 btyes in it echo hello >foo 05:26 < nataliep> oh thats why it gets pretty big 05:27 < flips> it's only the beginning 05:27 < nataliep> among other slabs 05:27 < flips> you also get an "address_space" for the inode 05:27 < flips> misnamed 05:27 < flips> that is the radix tree 05:27 < MaZe> so if the file is opened, the dentry + inode are locked in cache? 05:27 < flips> yes 05:27 < flips> and the whole chain of parents 05:27 < flips> up to the superblock of the fs 05:27 < flips> not locked 05:28 < flips> they can be evicted 05:28 < MaZe> usage count increased? 05:28 < flips> only until the inode goes away 05:28 < flips> sorry 05:28 < MaZe> you've lost me then. 05:28 < flips> the inode's use count is elevated 05:28 < flips> until the dentry goes away 05:28 < flips> it's about the nastiest part of the whole vfs 05:28 < flips> and we're here already 05:29 < MaZe> so dentries can come and go as they please? 05:29 < flips> what happens is, dentries spend a lot of their life sitting around in cache with zero use count 05:29 < flips> that's what happens if you open a file, do something, and close it 05:29 < RazvanM> $ cat /proc/slabinfo | egrep 'dentry|#' 05:29 < RazvanM> # name : tunables : slabdata 05:29 < RazvanM> dentry 253015 253576 132 29 1 : tunables 120 60 8 : slabdata 8744 8744 0 05:29 < flips> only when the vm comes along and tries to shrink the caches to recover memory do the dentries and inodes go away 05:29 < MaZe> note 132 05:30 < flips> yes, something ahs been pushing stuff out 05:30 < flips> it changes from time to time 05:30 < flips> these days, linux pushes too much cache out at the wrong times 05:30 < flips> you will notice that if you run on a slow machine 05:30 < flips> see, this is the real vfs course ;-) 05:31 < flips> ok, lifetime of objects in cache is one of the biggest touchy spots in linux 05:31 < flips> it's often very hard to know what owns what 05:31 < MaZe> there's no way to tell the vfs you're doing a bg filesystem scan and to not cache for eternity? 05:31 < flips> and yet, you have to when you work on fs code 05:32 < flips> there are various ways to tell it that 05:32 < flips> good ways is another question 05:32 < flips> we have the concept of hot and cold ends of lru list 05:32 < flips> when something is gets accessed, it gets moved to the hot end 05:32 < flips> and stuff is evicted from the cold end 05:32 < flips> in theory 05:32 < flips> in practice... well 05:33 < flips> linux has been benchmarked as worse than random replacement policy 05:33 < flips> somebody needs to go in and fix that 05:33 < MaZe> so a queue basically 05:33 < shapor> the only way i'm aware of to inform the kernel about your intentions is posix_fadvise, and that doesnt let you do much 05:33 < flips> a lru list, yes 05:33 < flips> acts like a queue 05:33 < shapor> certainly nothing related to dentries 05:33 < flips> old stuff is supposed to move down to the cold end and get evicted 05:34 < flips> shapor, though you were on a plane 05:34 < flips> just a sec 05:34 < shapor> not yet 05:34 < flips> back 05:34 < MaZe> is there one global dentry cache? per cpu? per socket? per fs? per inode? 05:34 < flips> ok, we're not doing vm 05:34 < flips> this is vfs ;-) 05:34 < RazvanM> is global, right? :P 05:34 < flips> there is one global dentry cache 05:35 < flips> it is indexed by fs*dir*name 05:35 < flips> so it acts like one per fs 05:35 < MaZe> so it maps superblock:inode:filename -> inode? 05:35 < shapor> the only way i know of purging it is umount(), right flips? 05:35 < flips> yes 05:35 < flips> in general,yes 05:35 < flips> there are internal interfaces for purging 05:36 < flips> a fs has access to that 05:36 < flips> but almost nobody understands how to use that or cares to find out 05:36 < flips> if you get it wrong, al will bark at you 05:36 < MaZe> aren't we wasting a lot of memory by continuously keeping the 'fs' in there? most systems don't have that many mounted filesystems 05:36 < flips> we waste huge buckets of memory 05:37 < flips> yes, linux is a little special in this regard 05:37 < flips> dentry cache is a linux only thing 05:37 < flips> it gives a performance advantage in general 05:37 < flips> but it uses massive gobs of memory 05:37 < flips> it's tricky 05:38 < flips> you can always print out the path a file was opened by 05:38 < RazvanM> other OSs doesn't cache the dentries? 05:38 < flips> by following parent links in the dentry cache 05:38 < MaZe> I don't think other os's have dentries? 05:38 < flips> I'm not _that_ familiar with bsd etc 05:38 < flips> but I think not 05:38 < MaZe> earlier you'd said the dentries could be evicted? 05:38 < flips> the above gets interesting when the namespace topology is changing while you follow links 05:39 < flips> they can 05:39 < flips> let's go find the dentry cache 05:39 < MaZe> so how does that match up with being able to follow the parent links in the dentry cache? 05:39 < flips> 05:40 < flips> hmm, no dentry.c 05:40 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/include/linux/dcache.h#L81 05:40 < MaZe> there's a dcache.h 05:41 < flips> ok, namei.c is the home of the dentry cache 05:41 < flips> inconsistent naming 05:42 < MaZe> struct dentry is defined in dcache.h though 05:42 < flips> in general in linux, you want to be looking for "get" and "put" operations 05:42 < flips> get means in object count, put means dec 05:42 < flips> strange terminology, made in linux I think 05:43 < flips> dache.c 05:43 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/dcache.c 05:43 < flips> so, dput 05:43 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/dcache.c#L185 05:44 < flips> big mess 05:44 < flips> but you have to get familiar with it 05:44 < flips> we also have iput, drop usage count of an inode 05:45 < flips> is this too much down inthe nitty gritty? 05:45 < MaZe> nope 05:45 < MaZe> speaking for myself only of course - but the nitty gritty is always what I failed to grasp 05:45 < flips> figuring out how an inode gets released is challenging 05:46 < flips> look at iput, then iput_final 05:46 < MaZe> http://lxr.linux.no/linux+v2.6.26.5/fs/inode.c#L1149 05:46 < flips> generally, if the fs does not want to take care of something, the vfs will do it for it 05:46 < flips> this is the case in iput_final 05:47 < flips> normally, inodes are dropped by generic_drop_inode 05:47 < flips> there we see some classic unix 05:48 < flips> the decision whether to delete an unlinked inode or not 05:48 < flips> by this time, the dentry is long gone 05:48 < flips> so is the directory entry, if i_nlink is zero 05:48 < flips> we're nearly done for today 05:49 < MaZe> I'm a little surpised by the fact than op can be null.. 05:49 < flips> hour is coming up 05:49 < RazvanM> we still have 10 more minutes! :D 05:49 < flips> the op can be null because the vfs does it in that case 05:49 < flips> I'm going to answer questions for the next 10 minutes 05:49 * RazvanM was about to ask about about deleting a directory 05:50 < MaZe> so you than have a null op in the superblock operations and instead have it handled through such if statements all over the place? 05:50 < flips> ok, let's go look a file_operations 05:50 < MaZe> that seems like very non-OO 05:50 < flips> maze, correy 05:50 < flips> correct 05:50 < flips> it's oo linux style 05:50 < flips> very few linux hackers known any oo language 05:50 < MaZe> is there a reason for that? that also seems worse performance wise... 05:51 < MaZe> since we then have the if instead of just calling the method 05:51 < flips> it doesn't cost much cpu 05:51 < flips> it's sloppy 05:51 < flips> and looks ugly 05:51 < flips> and is inconsistent 05:51 < MaZe> it's another branch that can be mispreditcted though 05:51 < flips> every operation has its own custom way of doing things, usually 05:51 < flips> if the branch matters, we tell the compiler not to mispredict 05:52 < MaZe> ugh... 05:52 < flips> see "likely/uinlikely" 05:52 < flips> unlikely 05:52 < MaZe> yeah, I know 05:52 < flips> the inefficiencies here are somewhat covered up by the fact that there are slow disks underneath 05:52 < flips> and then, it's not really inefficient 05:53 < flips> the stuff that _can_ cost lots of cpu has been profiled and fixed long ago 05:53 < MaZe> ok, so it's just disgusting and extra code complexity ;-) 05:53 < flips> these days, it costs a lot more to contend a spinlock than mispredict a branch 05:53 < flips> yes, it's fairly disgusting 05:53 < flips> one never learns to love it ;-) 05:53 < flips> respect it, yes 05:54 < flips> it does a lot, has a huge amount of flexibility 05:54 < flips> ok, there was a question 05:54 < flips> let's go look at how ext2 deletes a directory 05:54 < MaZe> is stuff like this not fixable? 05:54 < MaZe> right, delete a directory. 05:54 < flips> the right person could fix it 05:54 < flips> you have to have memorized stevens 05:55 < flips> and you have to like fighting in pig shit 05:55 < flips> do you llike fighting in pig shit? 05:55 < flips> because you have some of the other qualifications ;-) 05:56 < MaZe> I have a tendency to fight uphill battles, yes. 05:56 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/namei.c#L275 <- ext2_rmdir 05:56 < flips> pretty easy to read 05:56 < flips> and write for that matter 05:56 < flips> I didn't say uphill ;-) 05:56 < flips> it's not a hill 05:57 < RazvanM> flips: ack 05:57 < flips> it's a ditch at the bottom of the farm 05:57 < MaZe> ah, but sh*t flows downhill, and if you're at the bottom 05:57 < MaZe> stevens - which book is that referring to? 05:57 < RazvanM> when I asked the question I did not remember that the OS will refuse to delete a non-empty dir :P 05:57 < flips> ext2_rmdir is plugged into this thing: http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/namei.c#L376 05:57 < flips> and *_operations structure 05:58 < flips> passes for an instance of a class in linux 05:58 < flips> Advanced Programming in the UNIX Environment, Addison-Wesley, 1992. 05:59 < flips> ok, now that we have found what ext2_rmdir is plugged into, we can follow it back up into the vfs 05:59 < flips> clock on ext2_dir_inode_operations 06:00 < flips> sorry 06:00 < flips> clock on inode_operations 06:00 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L1250 06:00 < flips> then usage 06:01 < flips> lxr is spinning 06:01 < RazvanM> this is the slowest operations... 06:01 < flips> yes, and 3 doing it at the same time is enough to bring it to its knees 06:01 < flips> apparently 06:02 < RazvanM> actually, I would say is as slow as usual 06:02 < flips> as you can see, this is a popular struct 06:02 < RazvanM> true 06:02 < flips> your are looking for the instances that are _not_ in a specific filesystem 06:03 < flips> fs/namei.c, line 2971 06:03 < flips> for example 06:03 < flips> whoops, not interesting 06:03 < MaZe> bad_inode.c inode.c libfs.c 06:03 < flips> razvanm probably had the right idea 06:04 < flips> yes, inode.c is good 06:04 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/fs/inode.c#L114 06:05 < RazvanM> uhh... we're 5 minutes over time :P 06:05 * RazvanM says big thanks! 06:05 < MaZe> ;-) 06:05 < flips> yep 06:05 < flips> so homework: 06:05 < flips> find out were inode_operations->rmdir is called 06:06 < flips> it isn't spelled that way 06:06 < flips> this is what makes linux fun ;-) 06:06 < flips> very little is spelled the way you would expect 06:06 < RazvanM> :D 06:06 < flips> ok, did we have fun today? 06:06 < nataliep> that was awesome, too short :) thanks to all... i love the format of this class :) 06:06 * RazvanM did had fun :-) 06:07 < flips> thanks natalie :-) 06:07 < RalucaME> thanx flips, was cool 06:07 < flips> the most important item is how to navigate lxr 06:07 < flips> welcome, ralucame 06:07 * RazvanM is RalucaME's twin ;-) 06:07 < RalucaME> :) 06:07 < MaZe> where it's called from outside of fs'es? or within? 06:08 < MaZe> way too short - agreed. 06:08 < MaZe> at this rate we'll need more than a few of these ;-) 06:08 < MaZe> Thanks! 06:08 < flips> aha 06:08 < flips> from the vfs 06:08 < flips> that is, outside the fs 06:08 < flips> things eventually start to fit a pattern 06:09 < flips> and you don't need me to suggest how to follow the twisty paths any more 06:09 < flips> at first it looks like random gibberish 06:09 < flips> then later, you learn it is actually random gibberish 06:09 < RazvanM> :-) 06:09 < flips> but it is fast and flexible gibberish 06:09 < MaZe> flips: i_op->rmdir ? 06:10 < flips> sounds good 06:10 < MaZe> ups ;-) 06:10 < nataliep> aww 06:10 < flips> I personally always spell my ops "ops" 06:10 < flips> makes it much easier to navitage 06:10 < flips> so you look for ops->rmdir and you always find it 06:11 < MaZe> the code was "inode->i_op = &empty_iops;" in alloc_inode 06:11 < flips> boring 06:11 < flips> haven't found the real one nwo 06:11 < flips> yet I mean 06:12 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/fs/namei.c#L2256 06:13 < flips> that's it 06:13 < flips> gold star 06:14 < RazvanM> :P 06:16 < MaZe> oh, lol 06:16 < nataliep> more homework? 06:16 < MaZe> and I just sent it via pm 06:16 < flips> figure out how a struct inode gets deleted ;-) 06:17 * RazvanM will do pm next time 06:17 < MaZe> wild guess: iput gets called? 06:17 < flips> thursday again ok? 06:17 < MaZe> at 8pm? 06:17 < flips> maze, that's a good first order approximation 06:17 < flips> yes 06:17 < nataliep> sounds good 06:18 < MaZe> see you then! 06:19 < flips> cu