Re: [Charon-dev] Re: VFS not completely factored, and more

From: Michael W Zappe (zapman@interlan.net)
Date: Fri Jul 07 2000 - 10:08:12 EST

Next message: Hiu Fung Eric Tse: "Re: Printing out kernel message"
Previous message: Gary Lawrence Murphy: "Re: linux-kernel-documentation, proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Jul 01, 2000 at 02:14:45PM -0400, Alexander Viro wrote:
>
>
> [reformatted]
> On Mon, 26 Jun 2000, Michael W Zappe wrote:
>
> [snip]
> > filesystem, CXFS. (originally named Charon, but we discovered two
> > companies warring over the trademark, and didn't want to touch that with
> > a 40 foot pole... ;-)
>
> Heh. XFS folks mentioned clustered variant of their puppy. Yup, also
> CXFS...
>

Damnit! I had somehow missed the announcements when searching for it. Argh, time for yet another name change. ;-)

> > But in terms of the VFS problems, I have to say that your idea is a good
> > one, since it keeps the data local to the inode! The only issue i'd
> >have with it is the way you do the define. I think a much cleaner
> >solution would be to have each fs define a structure
> >
> > struct ext2_inode
> > {
> > struct inode ei_inode;
> > struct ext2_inode_info ei_ext2_info;
> > };
> >
> > And just clean up all of the other filesystems to use a more generic
> >mechanism to cast it, such as:
> >
> > #define EXT2_INODE_INFO(inode) (((struct ext2_inode *)(inode))->ei_ext2_info)
>
> Umm... I don't like how it sounds. Basically, it means that iget() goes
> to hell - that way allocation has to be done in fs. May be a good thing,
> but we are not there yet.
>

What do you mean by "we are not there yet?" I thought the point of this exercise was to get us there! ;-)

> > This would also mean that if anything changed, we have an inexpensive
> > (non-existent at runtime) level of abstraction which would allow for a
> > simple change. I also use a similar mechanism, and it works great for
> > portability. I'd definately be willing to help clean up the FS code to
> > use the new interface.
>
> Could you describe that new interface in some details? I don't see how you
> deal with knfsd and with allocation/freeing policy. Another thing to watch
> for: init_once()-type animals.
>

Allocation and freeing can be done in the read_inode, put_inode pairs. The extra data that a filesystem needs is persistent for the lifetime of the inode in the icache, unless a filesystem wished to do otherwise.

What do you mean exactly by an init_once type animal? If anything, read_inode() certainly seems to fall under that category.

> > I also wouldn't mind adding changing the iget code to a new iget2 which
> > takes an opaque, fs dependent parameter (void *). We would also split
>
> Not funny. First of all, you have to deal with callers in knfsd. If we
> kill them (fh<->dentry patches) there is no reason to keep iget() and
> ->read_inode() at all. Otherwise... Where the heck will knfsd get your
> void * cookie?
>

Good point about knfsd. I never really looked at it or compatability issues with my FS. What a horrible idea from the point of view of keeping a good layer of abstraction. The easiest answer is that knfsd passes NULL, and if a filesystem requires the additional cookie, then it doesn't work with knfsd. We should probably have a flag (FS_REQUIRES_COOKIE -- the cookie monster flag? ;-) in the fs_flags so it doesn't try to export it. Wow, talk about spagetti code...

So i guess the goal of the changes was to make NFS work super-fast, at the expense of extensibility? (Not meant sarcastically, it's a real question.)

> > the functions into
> >
> > void read_inode(struct inode *, void *);
> > void get_inode(struct inode *, void *);
> >
> > Read inode is called only on an uninitialized inode structure, and
> > get_inode whenever iget pulls an inode out of the cache. This way, if
> > the FS isn't concerned with additional reads after the first, the
> > pointer can just be NULL and we optimize out the call.
>
> ??? That's something new - which filesystems need to do reread upon the
> icache hit?
>

Not neccicarily a re-read, but possibly a transaction context, or to do a revalidation if the filesystem is acting as a cache, and generally let the fs know that it has one more refferer.

> > Also, i wouldn't mind changing i_ino to be a __u64. Anyone see any
> > major problems with this? We'd have to change the hashing code, (which
> > wouldn't be a bad idea anyway) but i think most everything else will
> > convert fairly cleanly. I have to look at this change a little more
> > closely, but it seems to be a good idea.
>
> No problem, except that you are getting an overhead on comparisons in the
> cache. My gut feeling is that it's not going to be critical, but there's
> only one way to figure out... Keep in mind that x86 is register-starved,
> so anything 64-bit is a strain on compiler. I suspect that it may turn
> into worse overhead than anything coming from separate allocation of
> per-fs part.

Not really. Yes, the x86 is register starved, but my experience with using 64 bits in the kernel is that the overhead isn't too bad as long as you stick to adds and shifts, which is all most fileystems need anyway. Is it also the case that someone has gotten MMX working in the kernel (i believe in the RAID checksumming/parity code)? Using that, we could get '1/2 cycle' 64 bit math + 8 extra registers on the Pentium MMX and on. Not to mention, the benifits of having truly 64-bit filesystems supported without big kludges would be nice for moving linux into the HA/Data Center area. The only problem is that the code would probably have __add_u64(a, b) rather than a+b, and have to add prefix and suffix code around alot of the icache functions. (Or beware the great no emms penalty...) Is it also generally assumend by the compiler that any function call can trash FPU state, yes? (The only problem with this that i can see...)

The dissapearance of the BITS_PER_LONG < 64 ifdef's in my code would be nice also... ;-)

Mike

----------------------------------------------
Michael Zappe <zapman@interlan.net>

Chief Architect, Filesystems
Interlan Communications
111 Corning Drive
Cary, NC 27511

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Hiu Fung Eric Tse: "Re: Printing out kernel message"
Previous message: Gary Lawrence Murphy: "Re: linux-kernel-documentation, proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Jul 23 2000 - 21:00:10 EST