05:01 < flips> let me introduce you to one of the foremost kernel hackers in the known universe 05:01 < flips> eric biederman 05:01 < flips> say hi :-) 05:01 < ebiederm> hello all. 05:02 < flips> eric is responsible for much of what makes linux great in the supercomputing cluster space 05:02 * RazvanM also says Hello! 05:02 < flips> konrad, don't be shy ;-) 05:02 < RalucaME> hi Eric 05:02 < konrad> hello :) 05:02 < flips> well eric is not really a vfs guy, just a general genius 05:03 < flips> knows everything about everything nearly 05:03 < RazvanM> :-) 05:03 < ebiederm> lol 05:03 * RazvanM double checks that the logging is enabled 05:03 < bh> hey 05:03 < flips> also, nataliep_ up there is the linux kernel bug manager 05:03 < bh> more folks have joined, nice 05:03 < flips> ok, let's start 05:04 < flips> first let me ask some questions: what does VFS stand for? 05:04 < RazvanM> virtual file system 05:04 < flips> close but no 05:04 < RazvanM> subsystem? :D 05:04 * flips listens to the sound of googling 05:04 < MaZe> hey 05:04 < flips> maze! 05:05 < MaZe> yeah, so 8pm is a little tight ;-) 05:05 < flips> maze is about the smartest smart person I met a google 05:05 < MaZe> hehe, thanks! 05:05 < flips> no exaggeration 05:05 < flips> ok, let's try again: what does VFS stand for? 05:05 < flips> googlling is ok 05:05 * RazvanM is diluting the quality of the channel :P 05:06 < flips> I doubt that, razvanm 05:06 < konrad> virtual file system 05:06 < konrad> er wait 05:06 < RazvanM> switch? 05:06 < konrad> that's been said hasn't it 05:06 < flips> right! 05:06 < flips> see? 05:06 < flips> razvanm wins 05:06 < flips> it stands for virtual filesystem switch 05:06 < RalucaME> versioning file system :P 05:06 < konrad> firefox had 'AVFS' at the top of my url bar for vfs :( 05:06 < flips> how it got that name, I don't know 05:06 < RazvanM> it was the first hit for 'vfs lnux' :P 05:06 < flips> eric probably does 05:07 < ebiederm> lol 05:07 < MaZe> it switches between the different filesystems like a network switch switches between computers 05:07 < flips> somebody better find out, because it's sure to come up at a geek challenge context at linuxtag eventually 05:07 < flips> yes 05:07 < flips> it is a colletion of methods that together implement a filesystem 05:07 < MaZe> find out what? 05:08 < flips> how it came to be called that 05:08 < ebiederm> I know where it came from but not why they picked the name. When the implemented the second filesystem on BSD they needed an abstraction layer. 05:08 < RazvanM> the vfs.txt from Documentation says: Overview of the Linux Virtual File System 05:08 < flips> who came up with it 05:08 < flips> etc 05:08 < MaZe> ah 05:08 < MaZe> trivia ;-) 05:08 < flips> I knew eric would win that somehow ;-) 05:08 < flips> well let me tell you 05:08 < flips> the foremost filesystem dev on bsd does not know what vfs means 05:08 < RazvanM> :D 05:08 < flips> or who called it taht, or why 05:08 < flips> yet he is definitely the foremost fs dev 05:09 < flips> everybody know his name? 05:09 < flips> quick... 05:09 < flips> hint: 05:09 < MaZe> I suck at trivia... I'm lucky to know my own name... 05:09 < RazvanM> McKusick? 05:09 < flips> he engaged in a discussion re tux3 design recently 05:09 < flips> mckusick is close but no 05:10 < flips> hint: firefly 05:10 < konrad> Dillon? 05:10 < MaZe> the dragonfly hammer guy? 05:10 < flips> yes! 05:10 < konrad> Matt Dillon IIRC 05:10 < RazvanM> hammer? 05:10 < flips> also responsible for linux having a reverse mapped vm 05:10 < flips> used to be the bsd vm guy 05:10 < flips> is now the vm fs guy 05:10 < flips> and runs his own distro 05:10 < flips> intensely clueful person 05:10 < flips> ok 05:10 < flips> let's do some vfs 05:11 * RazvanM is ready 05:11 < flips> and let's start from the opposite end that we started from yesterday 05:11 < flips> everybody got their browsers ready? 05:11 < RazvanM> yesterday? 05:11 < flips> eh 05:11 < flips> day before yesterday 05:11 < flips> last time ;-) 05:11 < RazvanM> :D 05:11 < MaZe> loaded ;-) 05:12 < konrad> lxr.linux.no should be my homepage or something 05:12 < flips> lets go here: http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c 05:12 < flips> super.c is the "main" for a linux filesystem 05:12 < flips> we might call it tux3.c for tux3, or we might go with tradition and call it super.c 05:12 < MaZe> it's got module_{init,exit} 05:13 < flips> it has two basic tasks: 1) parse the mount options 2) load the fs superblock 05:13 < flips> right 05:13 < flips> it takes care of a few other details besides 05:13 < flips> so let's take a look at some really crappy parsing code 05:14 < flips> parse_options 05:14 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L428 05:14 < konrad> line 429 05:14 < konrad> oops :) 05:14 < flips> depends on the version of course 05:15 < MaZe> 429 on mine as well 05:15 < flips> nothing really interesting here 05:15 < flips> just good to know where it is 05:15 < flips> so, there isn't actually such a thing as a linux "mount" program 05:15 < MaZe> so it gets a string and a pointer to the superblock? 05:15 < flips> all we do is call the fs's mount entry point 05:15 < flips> sbi 05:16 < flips> not quite the same 05:16 < flips> sbi is the filesystem-specific bit of a superblock 05:16 < MaZe> so that's the in-mem representation of an ext2 superblock 05:16 < flips> superblocks and inodes in linux are both generic structures 05:16 < flips> almost 05:16 < flips> re in-mem rep 05:17 < flips> there is also an exact image of the disk superblock that ext2 keeps around 05:17 < flips> I don't know if tux3 will bother 05:17 < flips> we shall see, that is a fiddly detain 05:17 < flips> the sbi corresponds to what is called struct sb in the tux3 userspace 05:18 < flips> and tux3 doesn't really have a generic superblock implemented at the moment 05:18 < flips> linux kernel does 05:18 < flips> superblock fields are separated into two classes: 1) ones that core vfs knows what to do with 2) ones that only mean something to the fielsystem 05:18 < flips> inodes are separated the same way 05:19 < flips> by a completely different mechanism, for not good reason 05:19 < MaZe> any idea what the 0pt_ in the tokens means? 05:19 < flips> the superblock specialization is via a fs-specific pointer 05:19 < MaZe> oh, its opt not 0pt ;-) 05:19 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L395 05:19 < flips> not really 05:20 < flips> 5 minutes of poking will answer that 05:20 < flips> or 1 minute 05:20 < flips> there is some fairly trivial macro magic going on here and there 05:20 < MaZe> [I mis-parsed as zero-pt font size...] 05:20 < flips> anyway 05:21 < flips> like I said, awful parsing code 05:21 < flips> used to be a lot worse 05:21 < flips> gets the job done in way too many lines 05:21 < flips> well lets look at a more interesting bit 05:21 < flips> loading the superblock 05:21 < flips> quite tricky 05:21 < flips> because the filesystem isn't working yet 05:21 < flips> we don't even know the blocksize 05:22 < flips> we have ext2_get_sb 05:22 < flips> which is stored in the ext2_fs_type structure 05:23 < MaZe> http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L366 05:23 < flips> of type "file_system_type" 05:23 < flips> this is the starting point for any filesystem 05:23 < flips> the tip of the iceberg 05:23 < flips> root of the tree 05:23 < flips> heart of the dragon etc 05:24 < flips> file_system_type defines a few methods, by far the most important of which is get_sb 05:24 < flips> this structure is passed to register_filesystem 05:24 < flips> when the module is initialized 05:24 < flips> which happens these days whether or not is actually a module by the way 05:25 < flips> and that makes the filesystem appear in /proc/filesystems 05:25 < flips> so everybody should do cat /proc/filesystems now 05:25 < flips> and tell what they see there that is really interesting 05:26 < RazvanM> lots of nodev's 05:26 < MaZe> lots of internal no-blockdev fs'es and 4 dev-fs'es 05:26 < flips> suggesting that nodiv is a stupid idea... 05:26 < flips> which is true 05:26 < flips> and? 05:26 < RazvanM> my oly non-nodev are ext3 and vfat :P 05:26 < MaZe> well, there's ext3, hfsplus, iso9660, fuseblk 05:26 < flips> right 05:26 < flips> and there is no tux3 05:26 < MaZe> and a ton of internal ones (usb, ramfs, etc...) 05:26 < RazvanM> :D 05:26 < flips> that is the most important thing to notice 05:26 < flips> and that is why there is a tux3 university 05:27 < flips> notice also that there is a ramfs 05:27 < flips> ramfs is the second most useful filesystem for learning about the vfs 05:27 < flips> the most useful being ext2 05:27 < RazvanM> also sockfs 05:27 < MaZe> is sockfs for unix domain sockets? 05:27 < flips> suckfs 05:27 < flips> right 05:28 < RazvanM> :-) 05:28 < MaZe> I'd prefer a shoefs 05:28 < flips> don't take anything from the net side of linux as an example of anything besides "fast" 05:28 < konrad> sk8fs 05:28 < flips> yup 05:28 < flips> I see "fuse" 05:28 < flips> interesting 05:28 < flips> in fact, 3 or them 05:28 < MaZe> fuse ,fuseblk, and fusectl 05:28 < flips> 3 of them 05:29 < flips> that's a little over the top 05:29 < MaZe> oh, this one is always a laugh: hugetlbfs 05:29 < flips> a naive person would think one would be enough 05:29 < flips> or would already be one too many 05:29 < flips> hugetlbfs is indded the worst fs ever conceived 05:29 < MaZe> what's the difference between rootfs/ramfs/tmpfs ? 05:29 < flips> sometimes even the great penguin has bad days 05:29 < flips> rootfs exists just to get linux booted 05:30 < flips> probably a bad idea 05:30 < flips> but that's how it works 05:30 < flips> ramfs is really interesting 05:30 < flips> it is basically just the vfs cache layer of a fs with all backing store stripped away 05:30 < flips> it is worth reading every line 05:30 < MaZe> is the split merely to be able to shave off more code in embedded? 05:31 < flips> it is split for tutorial reasons 05:31 < MaZe> ;-) 05:31 < flips> ramfs is to serve as an example of a minimal fs with no backing store 05:31 < flips> somehow it bloated up to 589 lines though 05:31 < flips> when it really only needs 150 maybe 05:32 < flips> so I guess somebody didn't get the memo ;-) 05:32 < flips> tmpfs is the real workhorse 05:32 < flips> that is basically ramfs backed by the swap device 05:32 < RazvanM> $ wc -l file-mmu.c 05:32 < RazvanM> 53 file-mmu.c 05:32 < flips> common mounted on /tmp these days 05:32 < flips> commonly 05:32 < flips> ok, I'll take a short break 05:32 < flips> to refill my cabernet 05:33 < MaZe> so tmpfs can be swapped out, while ramfs and rootfs can't 05:33 < konrad> linus pronounces 'vfs' as 'virtual filesystems' in ramfs/inode.c 05:33 < flips> and why don't you compare notes? 05:33 < flips> linus doesn't always get it right ;-) 05:33 < flips> tytso would normally clobber him in a geek trivial contest 05:34 < RazvanM> http://farm1.static.flickr.com/164/413387043_ab2c7569a4.jpg :P 05:34 < flips> :-) 05:35 < flips> the reflection isn't quite as nice here 05:35 < flips> but it does reflect, in this idea desk 05:35 < flips> ikea 05:35 * RazvanM also sits at an ikea desk ;-) 05:36 < flips> ok, let's go up to ext2_fill_super 05:36 < flips> we pass that as a method to a vfs library call 05:36 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L737 05:36 < flips> if you think that is an odd way to init a fs you'd be right ;-) 05:36 < flips> so what is an sbi? 05:37 * flips waits 05:37 < RazvanM> sb info 05:37 < konrad> ext2_sb_info ptr 05:37 < flips> right, and what points at it? 05:37 < konrad> sb->s_fs_info 05:38 < flips> right 05:38 < flips> so that is how the linux fs specializes a superblock 05:38 < flips> by haing s_fs_info point at something allocated and initialized by the fs 05:38 < flips> that only the fs will ever use 05:38 < MaZe> how does it know how big to make it? 05:39 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L768 <- here we read the superblock 05:39 < konrad> MaZe: sizeof(*sbi) 05:39 < flips> maze, the fs declares it, and it makes it sizeof(that) 05:39 < MaZe> won't that be fs dependant though? 05:39 < flips> it is 05:39 < flips> that is why it is a fs-specific pointer field 05:39 < konrad> pointer is always the same size :) 05:39 < flips> core vfs will never look there 05:40 < flips> right 05:40 < MaZe> oh, right it's allocated within ext2 code 05:40 < flips> thank goodness for that small mercy 05:40 < flips> right 05:40 < MaZe> here http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L755 05:40 < RazvanM> http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L144 05:40 < flips> one can easily imagine a universe in which pointers on the same machine are not all the same size 05:41 < MaZe> keep'em beasties far away from me... 05:41 < flips> so there is a some braindamage about trying to use the "blocksize as the device" to load the superblock 05:41 < flips> bad idea 05:41 < flips> should just assume that it is always the same size 05:41 < flips> there is no legitimate concept of blocksize on a device, actually 05:42 < flips> never mind that I have coded one in my vfs emulation ;-) 05:42 < flips> that is a wart I will get rid of probably one day when it irritates me enough 05:42 < flips> only the fs sbi should know the blocksize of the filesystem 05:43 < flips> so, that nonsense about device blocksize is so that ext2 can use "sb_bread" to read the superblock 05:43 < flips> again, there is no reason for this 05:43 < flips> the tux3 userspace code directlly case "diskIo" there 05:43 < flips> bypassing the buffer emulation 05:43 < flips> and ext2 really should do the same, not have that fragile blocksize code there 05:44 < MaZe> not get_sb_bdev ? 05:44 < flips> right 05:44 < flips> equivalent of tux3 diskio 05:44 < flips> well 05:44 < flips> these fns have a lot of cruft attached 05:44 < flips> been through many iterations of doing things the wrong way 05:45 < flips> so you want to go to the lowest level thing that will actually read if you want to be clear and robust here 05:45 < flips> I'd be tempted to submit a bio 05:45 < flips> but anyway 05:45 < flips> we'll get there soon enough, and have to implement our own version of that 05:45 < flips> let's do it a little more cleanly, but we don't have to save the world 05:45 < flips> just now 05:46 < flips> 873 /* If the blocksize doesn't match, re-read the thing.. */ <- excellent example of yunk 05:46 < flips> huck 05:46 < flips> yuck 05:46 < flips> :-) 05:46 < flips> "yunk" is short for "yucky junk" 05:46 < flips> and "huck" is what we will do with that in tux3 05:47 < flips> so by here ext2 has managed to read its superblock: http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L898 05:47 < flips> should actually have only been 3 lines, though we did do some options processing as well 05:47 < flips> most of that is historical cruft 05:48 < flips> keep in mind that ext2 is one of the cleanest filesystems ;-) 05:48 < RazvanM> :D 05:48 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L915 <- ext2 dutifully reads the frag size, even though this bsd ufs concept was never implemented and never will be 05:49 < flips> http://lxr.linux.no/linux+v2.6.26.5/fs/ext2/super.c#L941 <- it checks the super magic 05:49 < flips> tux3 gets to this point about 20 lines in or so 05:50 < flips> a few more than that actually 05:50 < flips> tux3.c 05:50 < flips> but in the kernel implementation, it will be about a dozen lines from the fill_super entry 05:50 < flips> as it should be 05:50 < flips> next big job is to read the root directory! 05:51 < flips> this is exciting because the filesystem isn't working yet 05:51 < MaZe> wouldn't it be enough to just get the rootdir's inode number? 05:51 < flips> we need to get the root dir up and running as an inode 05:51 < flips> so that that open (2) and readdir work on it 05:51 < flips> so yes 05:52 < flips> we need to know the rootdirs inode number 05:52 < flips> that has evolved over time with ext2 05:52 < flips> it used to just be a fixed number 05:52 < flips> now there is a fancier method 05:52 < flips> for no good reason 05:53 < flips> Tux3 uses inode number 0xd (for "directory" or "daniel") for the root dir 05:53 < MaZe> http://lxr.linux.no/linux+v2.6.26.5/include/linux/ext2_fs.h#L61 05:53 < flips> right 05:53 < flips> somewhere there is "good ol'" something 05:54 < MaZe> first non-reserved is 11 05:54 < flips> I might have conflated that with something else 05:54 < MaZe> #define EXT2_GOOD_OLD_FIRST_INO 11 05:54 < flips> well it doesn't matter except for geek quizes 05:54 < flips> yah 05:54 < flips> that's it 05:54 < flips> ok, we have 6 minutes for questions 05:54 < flips> going to stop right here, just before doing anything interesting ;-) 05:55 < RalucaME> exactly! :O 05:55 < konrad> ouch 05:55 < MaZe> when's the next meeting? next tuesday at 8pm? 05:55 < RazvanM> this lesson was definitely shorter... 05:55 < flips> well it was fun looking at all that busy looking code that doesn't actually do much, no? 05:55 < RalucaME> it seemed too little time 05:55 < RazvanM> how about tomorrow? :D 05:55 < flips> next tuesday, yes 05:55 < MaZe> yeah, tomorrow works 05:55 < MaZe> tuesday, then? 05:56 < flips> homework is: know how the root dir is loaded and initialized, and now that differs from how any other inode is opened 05:56 < flips> and how 05:56 < flips> I meant 05:56 < flips> tomorrow is friday ;-) 05:56 < MaZe> so what's the 'desired' way to read data off disk in a fs? submit bio-s? would that also be the best way to read the superblock (you seem to have suggested that) 05:57 < RazvanM> friday is my most productive day :P 05:57 < flips> not only do I have to relax then, I have to get atom refcounting working 05:57 < flips> maze, I like submit_bio, yes 05:57 < flips> then you have to wait on some lock 05:57 < MaZe> is that the lowest level interface to the block device layer? 05:57 < flips> two or three lines 05:57 < flips> it is 05:57 < flips> the lowest one you can use without getting shouted at 05:57 < MaZe> does it support priorities? 05:57 < flips> depends on the elevator 05:58 < flips> mostly linux elevators are pretty crappy 05:58 < flips> no good rt elevator for example 05:58 < flips> if that's what you're asking 05:58 < MaZe> yeah, something like that 05:58 < flips> feel free to write a noncrappy one 05:58 < flips> you're the man to do it 05:59 < MaZe> if I'm operating on behalf of a user, and he's running at some prio, or asking for some priority on his read/write/file op, than I'd like to be able to pass that down to the blockdev layer 05:59 < flips> yes, and save us from that broken pos that is the current io scheduler 05:59 < MaZe> I mean I obviously shouldn't be dealing with that in the fs, except for making sure I submit requests with the right priorities 05:59 < flips> no, not in the fs 05:59 < flips> though one can imagine the fs making suggestions 06:00 < flips> and a realtime fs most certainly has to interact with the io scheduler 06:00 < RazvanM> (submit_bio is used only by xfs, ocfs2, jfs, gfs2 and ext4) 06:00 < flips> the fs also has to answer the question "can I submit this request at all, and meet the constraints" 06:00 < MaZe> what about networking? how would you go about sending/receiving udp? tcp? raw frames? other protocol? 06:00 < MaZe> (what do the others use?) 06:00 < flips> only the fs can know certainly crucial information about those constrainnts 06:01 < flips> networking? 06:01 < flips> sorry, missed the connection 06:01 < flips> you mean realtime? 06:01 < MaZe> [have to be careful - low prio process fetches a directory, higher priority process than needs to fetch it again - needs to result in increasing the bio priority or resubmitting it or something] 06:01 < flips> razvanm, notice that submit_bio is used in all _modern_ fs's 06:01 < MaZe> networking connection - I'm imagining a disk and network based multi-node fs 06:02 < MaZe> I'm imaginative ;-) 06:02 < flips> gfs2 only loosely meeting that definition 06:02 < RazvanM> flips: right :D 06:02 < flips> maze,you're already IO fixing priority inversion? 06:03 < flips> ah 06:03 < flips> right 06:03 < MaZe> no, just pointing out you have to be careful 06:03 < flips> that kind of networking 06:03 < flips> you do 06:03 < flips> and as a rule we are not 06:03 < flips> far from it 06:03 < flips> tcp/ip is not realtime 06:03 < flips> however 06:03 < MaZe> so there's a lot of things I'd like to work on if I had the time ;-) 06:03 < flips> you can kinda sorta pretend it is, sometimes 06:03 < flips> right 06:03 < MaZe> networking is real-time if you have caching done correctly ;-) 06:04 < flips> really? 06:04 < flips> you will have to convince me of that 06:04 < flips> I think that random backout already makes it not realtime 06:04 < flips> CSMACD 06:04 < flips> or something like that 06:04 < MaZe> oh, ok, I don't mean RT as in rtlinux rt 06:04 < flips> carrier sense multiple access collision detect 06:04 < MaZe> I meant usable on a desktop 06:04 < flips> ah 06:04 < flips> I always mean actual rt when somebody says rt 06:04 < RazvanM> flips: if the data is represented identified by some hashes over them then it could be ;-) 06:05 < MaZe> I meant usable and not get killed by background tasks 06:05 < flips> there linus and I differ 06:05 < RazvanM> (for reading) 06:05 < MaZe> uhm, I never mentioned rt ;-) 06:05 < flips> razvanm, what could be? 06:05 < flips> maze, ok 06:05 < flips> sorry 06:05 < flips> try again? 06:05 < RazvanM> too many threads :D 06:05 < flips> yup 06:05 < MaZe> while rt is nice of course, and you should design with making it possible in the future of course 06:05 < flips> the phillips switch is overloading 06:06 < MaZe> I just wanted fg tasks to be able to run at higher priority than bg tasks (a garbage collector or bg file scan or ...) 06:06 < MaZe> ;-) 06:06 < flips> ok, well a single node filesystem has no business knowing anything about networking 06:06 < MaZe> right 06:06 < flips> yes, you have control over that 06:06 < flips> complete control 06:06 < flips> you are root 06:06 < flips> beyond root 06:06 < MaZe> we've already determined that a fs has to provide some interfaces to the vfs layer, and it interfaces with the blockdev layer via bio's 06:07 < flips> there's only one limitation to what a filesystem in linux can do: use symbols that are not exported to modules, when it is compiled as a module 06:07 < MaZe> add in some atomics/locks/primitives already provided by the kernel and mem management, and you have all pieces ;-) 06:07 < flips> yes 06:07 < flips> watch out for layer violations 06:07 < flips> but in general, go crazy 06:07 < MaZe> so basically, now the question was: how to implement nfs - what would the interface not to blockdev, but to network, be? 06:08 < flips> there is not much to do 06:08 < flips> nfs basically runs on top of a filesystem that doesn't even have to know its there 06:08 < flips> there are a few small, weird hooks 06:08 < MaZe> uhm? 06:08 < flips> the details of which I forget 06:08 < flips> nfs stacks on top of a host fs 06:08 < flips> the host fs doesn't have to know it's being stacked on 06:09 < flips> it just have to behave itself 06:09 < flips> like a unix fs 06:09 < MaZe> what do you mean by host fs? oh you mean for the nfs server? 06:09 < flips> that's actually pretty hard ;p) 06:09 < MaZe> I was thinking about the nfs client 06:09 < flips> right 06:09 < flips> ah, nfs client 06:09 < flips> strange exception to pretty much everything 06:09 < flips> it stacks on top of a remote host fs 06:10 < flips> with all the oddities that implies 06:10 < MaZe> including mid-flight reboots 06:10 < flips> indeed 06:10 < flips> there are papers written about how much this sucks 06:10 < flips> let me see 06:10 < flips> http://www.cc.gatech.edu/classes/AY2007/cs4210_fall/papers/nfsOLS.pdf 06:10 < MaZe> the reboot? yeah, that's terrible, but it can be done in a way that it would work 06:10 < flips> marginally 06:10 < MaZe> you'd detect remote server reboot and have to dump caches, etc... 06:11 < flips> I've been living/breathing that for the last 3 years 06:11 < MaZe> I know ;-) 06:11 < flips> yes, but we don't 06:11 < flips> it's pathetic 06:11 < flips> nobody pays attention to statd 06:11 < flips> except lockd 06:11 < flips> no excuse 06:11 < MaZe> oh, I'm not thinking about NFS, I hate NFS, I'm thinking about a networkfs 06:11 < flips> sun braindamage 06:11 < flips> and linux too, because we should have fixed it by now 06:12 < flips> oh a real networkfs 06:12 < MaZe> just trying to figure out what the layering is there vfs / networkfs (missing this interface layer) networking 06:12 < flips> well, lustre is getting close 06:12 < flips> oscfs2 also 06:12 < flips> I'm sure you will crack that one 06:12 < flips> will be fun to watch your progress 06:12 < flips> in the meantime, goals with tux3 are modest 06:12 < MaZe> I need more than 24 hours in a day 06:13 < flips> that is: support nfs no worse than any other filesystem 06:13 < flips> hopefully much better 06:13 < MaZe> hehe 06:13 < flips> ebiederm, thanks for visiting 06:14 < flips> I hope we did not disappoint ;-) 06:14 < RazvanM> an OT question: why hg and not git? 06:14 < flips> ok, it is back to the question of atom refcounting 06:14 < flips> you been following the thread, maze? 06:15 < MaZe> sorry, which thread? 06:15 < flips> razvanm, hg is a lot more usable than git 06:15 < MaZe> about mercurial? 06:15 < flips> instand on 06:15 < flips> maze, no, about xattr atoms 06:15 < MaZe> ah, no. 06:15 < flips> on the tux3 list 06:15 < MaZe> should I? 06:15 < flips> please 06:15 < flips> you subscribed? 06:16 < MaZe> glancing ;-) 06:16 < flips> I think I subscribed you 06:16 < MaZe> more xattr design details? 06:16 < flips> right, and associated posts 06:16 < flips> the parent of that is the root of that tree 06:17 < MaZe> uhm, gmail doesn't do trees ;-) 06:17 < MaZe> they should fix that 06:17 < flips> :p 06:17 < flips> it's only beta 06:17 < MaZe> right, it's also slow... 06:17 < flips> let me see 06:17 < flips> I know, I run exim4 here and it's beyond fast 06:17 < flips> it's scary 06:18 < MaZe> so I'm a big fan of atoms, because the space saving can be extreme 06:18 < flips> [Tux3] The long and short of extended attributes 06:18 < flips> ah, I like the sound of that 06:18 < MaZe> you probably want to support even more atoms for selinux... but then the code gets complex 06:18 < flips> I've been doing a lot of introspecting about it 06:18 < MaZe> so you have the easy solution - use no atoms 06:18 < flips> always on the verge of mass deleting that code 06:19 < flips> I know, but I also feel its lame 06:19 < MaZe> and just store rep { string=string } 06:19 < flips> no null's thanks ;-) 06:19 < flips> ext3 is 8 bit clean 06:19 < flips> but otherwise yes 06:19 < MaZe> (mind you I'd actually store that in reversed order, at the front of the file, going backwards towards negative offsets) 06:20 < flips> reccount, namecount, , 06:20 < MaZe> have it stored the same way as the rest of the file data 06:20 < flips> ? 06:20 < MaZe> xattr1=value1 xattr2=value2 filecontent="hello" ==> 06:21 < flips> sorry, I meant tux3 is 8 bit clean 06:21 < flips> where are the negative offsets? 06:21 < flips> oh I see 06:21 < MaZe> 2eulav=2rttax 1eulav=1rttax hello 06:21 < MaZe> | offset 0 at [H] in hello 06:21 < flips> demented ;-) 06:21 < flips> interesting idea 06:21 < MaZe> it means you don't have to implement it though ;-) 06:22 < flips> well the page cache doesn't have negative offsets 06:22 < flips> you'd have to store at the top of the index range 06:22 < flips> that's a good idea 06:22 < flips> it should work out fine 06:22 < flips> means you can't quite have a 16 TB file on 32 bit linux though 06:23 < flips> 16 TB less the maximum size of attributes 06:23 < MaZe> no, you shave it down by however many xattrs you have 06:23 < MaZe> so maybe a few kilobytes - in the future maybe more... who knows 06:23 < flips> ok, that's twisted enough for me 06:23 < MaZe> in what sense is it twisted? 06:23 < flips> works perfectly on 64 bit linux... probably find a couple of radix tree bugs 06:24 < flips> eeking out a small simplification by using the other end of the address range 06:24 < flips> twisted 06:24 < flips> I like it 06:24 < MaZe> right you have to be signedness clean, or you can offset everything by a zero offset constant 06:24 < flips> right 06:24 < flips> like I way 06:24 < flips> probably turn up a couple core linux bugs there 06:24 < flips> but worth doing just for that reason 06:24 < MaZe> or you can even just store it like this 0:hello empty space for expansion reverse xattrs :-1 06:25 < MaZe> since you have to support holes anyway... 06:25 < flips> sure 06:25 < flips> it allows us to treat xattrs more like file data in kernel 06:25 < flips> that's a tux3 meme 06:25 < MaZe> exactly 06:25 < flips> so I like it 06:25 < MaZe> it means xattr support in the fs on-disk image is basically free 06:26 < flips> for now we have the "xcache" 06:26 < flips> which is even faster to access than a page cache mapping page 06:26 < flips> well 06:26 < flips> hmm 06:26 < flips> is it? 06:26 < flips> somewhat 06:26 < MaZe> I think it's mostly free 06:26 < flips> gets close 06:26 < flips> I was going to have separate btree for big xattrs 06:27 < flips> and small ones go inthe inode, just like immediate file data 06:27 < MaZe> (still imagining a world with just one btree) 06:27 < flips> but mapping intermediate sized attributes into the top of the file address space is a possibility 06:27 < MaZe> theoretically you can put almost all file metadata at the -1 point 06:27 < MaZe> not only xattrs 06:27 < flips> thejust one btree idea has already been done, it's called hammer 06:27 < MaZe> not sure how that would work for performance 06:28 < MaZe> but you'd get versioning for free 06:28 < flips> I think that two level btree is significantly more cache efficient 06:28 < flips> I've played with mapping file metadata into the file address space before 06:28 < MaZe> perhaps. 06:28 < flips> without joy 06:28 < flips> spent a lot of mental energy on it, found no real wins 06:28 < MaZe> where are the problems? 06:29 < flips> finding a reason to do it 06:29 < flips> an example that runs faster 06:29 < MaZe> yeah, it's probably worth optimizing the hell out of inode stat time 06:29 < flips> stat time? 06:29 < flips> ah 06:29 < flips> yes 06:30 < MaZe> how fast you can stat a bunch of inodes 06:30 < flips> tux3 is going to work very well there 06:30 < flips> basically just run down the inode table 06:30 < MaZe> wait a minute, it's a table? not a btree? 06:30 < flips> and the inode table will be intitionally laid out in a clumpy way 06:30 < flips> it's a btree 06:30 < MaZe> oh, ok. 06:30 < flips> call it a table for historical reasons 06:31 < flips> variable size inodes 06:31 < flips> a tux3 exclusive, maybe 06:31 < flips> really defines the design and implementation 06:31 < MaZe>  2) Refcount all atoms and delete any that fall to zero <- my vote 06:31 < flips> mine too 06:31 < flips> just challenging to do as fast as the crude approach 06:31 < MaZe> possibly delaying cleanup till unmount, not sure if that would ease up anything though 06:32 < flips> tux3 has the concept of log rollup 06:32 < flips> I'll be posting about that in much more detail over the next week or so 06:32 < flips> it's continuous cleanup 06:32 < flips> doesn't have to be a flurry of cleanup wither at umount or mount 06:32 < flips> or remount after crash even 06:33 < MaZe> you can actually put it in the btree ;-) 06:33 < flips> why? 06:33 < MaZe> you want search through it to be efficient - both ways 06:33 < flips> oh right 06:33 < MaZe> both atom -> string conversion and string -> atom conversion 06:33 < flips> interesting idea 06:33 < flips> oh 06:33 < flips> I thought you meant the log 06:33 < MaZe> have some reserved btree prefix 06:34 < flips> of course the atom table will be a btree 06:34 < flips> it will be an HTree in facrt 06:34 < flips> fact 06:34 < MaZe> the log? yeah though about how the log could be in the btree 06:34 < MaZe> even had some half-baked concept, but didn't think about it long enough to really know if that's worth even thinking about 06:34 < flips> turns out that the deficiencies of HTree that make it tough to implement readdir accurately don't apply at all to the xattr atom use case 06:34 < flips> and htree is just about optimal for that 06:35 < MaZe> atom->string is just an array, since there's no holes 06:35 < flips> as far as reverse conversion goes... 06:35 < flips> there are two ideas I'm considering 06:35 < flips> one is to use the address of the dirent as the atom number 06:36 < flips> this decreases thedensity of the atom space somewaht 06:36 < flips> by a factor of 4 to be precise 06:36 < MaZe> huh, how does that work? 06:36 < MaZe> oh, right, I think I see 06:36 < flips> just look up the dirent and return the offset fromthe beginning of the file as the atom number 06:36 < MaZe> have the atoms themselves be pointers 06:36 < MaZe> cute 06:36 * flips has to put a different keyboard onthis machine with a better space bar 06:36 < flips> right 06:37 < flips> the other option is to have a reverse lookup table, that points back at the dirents 06:37 < MaZe> potentially div 4 or something to make em more likely to fit in a byte 06:37 < flips> I favor the second 06:37 < flips> because I like the atoms to be as dense as possible 06:37 < flips> for compression reasons 06:37 < flips> I already took the div4 into account ;-) 06:37 < MaZe> I'm still not convinced compression of this part of the fs really matters... 06:38 < flips> sure it does 06:38 < flips> atom number field is current 16 bits 06:38 < flips> 64K atoms 06:38 < flips> before having to go to a 32 bit atom number 06:38 < flips> that's comfortable 06:38 -!- stargazr5 [~gauravstt@59.95.38.255] has joined #tux3 06:38 < flips> 14 bits not so much 06:38 < flips> still 06:39 < flips> could go either way on that 06:39 < MaZe> Terrible hack: 06:39 < MaZe> $ getfattr -n user.hash -e text -h --absolute-names -L xhash 06:39 < MaZe> # file: xhash 06:39 < MaZe> user.hash="1114234:1191:1219215805:e233bf8dd0415ec9b7fea0193803357c:6325f0060bd5f23cf6ba106fd6500efa76d9bc5e" 06:39 < MaZe> Storing mtime/md5sum/sha1sum in a xattr for fs recovery ;-) 06:39 < flips> got to decide by midnight ;-) 06:39 < flips> ? 06:40 < MaZe> so I store the mtime:md5sum:sha1sum of each file on my drive in a xattr for that file 06:40 < MaZe> I get constant time md5sum calculation on files 06:40 < MaZe> ease of verifying file integrity 06:40 < flips> cool 06:40 < MaZe> and I can verify integrity of files in case of fs crash (ie. like when I upgraded to 2.6.27-rc3) 06:41 < flips> I think I like it more than zfs "checksum everything" mentality 06:41 < flips> makes sense to only checksum logically 06:41 < MaZe> and yes it does need to be regenerated on file modifications, so the newest files lack it 06:41 < flips> sha1 is ok only if you want crytographic verifiability, otherwise it's slower than necessary 06:42 < MaZe> compare that with my laptop 20mb/s read speed... 06:42 < MaZe> and it doesn't matter 06:42 < flips> it matters if you're running a server 06:42 < flips> a lot 06:42 < MaZe> true 06:42 < flips> option? 06:42 < MaZe> right 06:42 < MaZe> probably include something like crc64 or whatever cheap 64-bit hash you can find 06:43 < MaZe> (no idea what a fast good 64-bit hash is nowadays) 06:43 < flips> crc is bad 06:43 < flips> funnels to hell 06:43 < flips> dx_hack_hash is getting closer 06:43 < flips> uses a hacked lfsr idea 06:43 < flips> needs analysis 06:44 < flips> maze, you'd be good at that 06:44 < flips> I think 06:44 < MaZe> we appear to have scared of everybody else... noone is asking any other questions ;-) 06:44 < flips> yeah 06:44 < MaZe> analysis of speed? or of hash spread? 06:44 < flips> and they're the ones who actually check in code ;-) 06:44 < flips> got to be careful about that 06:45 < flips> hash spread 06:45 < flips> etc 06:45 < MaZe> I'm in the middle of a cluster turn up... 06:45 < flips> speed is about optimal 06:45 < flips> I made sure of that 06:45 < flips> well 06:45 < flips> truth be told I could make it much faster 06:45 < MaZe> hopefully I can at least provide 'inspiration' or something 06:45 < flips> it's meant for hashing short strings with good spread 06:46 < flips> short, very nonrandom strings 06:46 < flips> does a good job of that 06:46 < MaZe> re: atom refcounting 06:46 < MaZe> you don't have to sync it to disk really if you are hacky/smart about it 06:46 < flips> I'm going to post the results of my design thinking from the skate earlier 06:47 < flips> really? 06:47 < MaZe> since you can put it in the log 06:47 < flips> sounds like magic 06:47 < flips> of course 06:47 < flips> planned 06:47 < flips> or I wouldn't have gone this route at all 06:47 < MaZe> and if the order is right, then it can never get out of sync 06:47 < flips> again, of course 06:47 < MaZe> and the entire thing should be small enough you can periodically just write out a new copy 06:47 < flips> I've been computing the exact percentages of log bandwdith that will be required ;-) 06:47 < MaZe> of the entire thing 06:47 < flips> again, of course 06:48 < flips> but we don't 06:48 < flips> we even do that incrementally 06:48 < MaZe> and: you can afford to lose decrements, since at most the ref counts will be too high 06:48 < flips> and arrange the structure that have to be updated to be close together 06:48 < flips> and compact 06:48 < flips> ah 06:48 < MaZe> which is kind of dirty... 06:48 < flips> really? 06:48 < flips> way dirty 06:48 < flips> but there is likely something there 06:48 < flips> you can't lose track long term 06:49 < flips> that would be bad 06:49 -!- tim_dimm [~timothyhu@adsl-67-114-40-138.dsl.scrm01.pacbell.net] has joined #tux3 06:49 < flips> but you can do a false-positive-to-be-tested-later kind of thing 06:49 < MaZe> I'd guess most fs'es will have 2 dozen or less atoms 06:49 < flips> he tim_dimm 06:49 < flips> welcome back, daddy! 06:49 < MaZe> hey 06:49 < tim_dimm> wassap? 06:49 < flips> well 06:50 < tim_dimm> they're doing good 06:50 < flips> we just did episode 2 of tux3 university 06:50 < flips> how'd I do, maze? 06:50 < tim_dimm> ah, missed it! 06:50 < flips> keeping your interesting, hit the right level? 06:50 < MaZe> ok, though I think the first was more action packed 06:50 < flips> enough swear words? too many? 06:50 < MaZe> hehe 06:50 < flips> well I can easily pick up the pace 06:50 < flips> it's just that, where we were is where the tux3 kernel port willa actully start 06:51 < MaZe> I wish there was a: these are your primitives, this is how they function, know this and C and data structures and you don't need to know anything else linux specific