Windows NT File Systems and Linux File Systems

Jeff Merkey (jmerkey@timpanogas.com)
Tue, 22 Jun 1999 12:15:13 -0600


Ingo,

I've reviewed the code sections you are refering to. This is along the same
lines to what Windows NT is doing, but not identical. From what I've seen,
this could really help Linux. The next step would be to create file runs
for SMP read/write. The I/O model of NT allows these runs to be transacted
in parallel with the I/O subsystem for SMP and "remembers" them. For a good
idea about what NT is looking for, check out NWFILE.C in our code base at
207.109.151.240. There is a function called TrgLookupFileAllocation in
the NT specific section. By way of example, Linux uses the NWReadFile
function while Windows NT uses the TrgLookupFileAllocation function. MS
actually stores the file allocations in the VM Cache, and then transacts
them by completely bypassing the file system driver. In essence, the file
system driver is reduced to simply being an on-disk decode agent for the VM
Cache Manager. In the Windows NT Fenris, only certain requests ever make it
into an IORP request. Files are read and written directly by the VM Cache
Manager and the File System driver only maintains meta-data fior the file
system (FAT's, Dirs, etc.). This is how NTFS, FASTFAT, CDFS, and FENRIS are
all structured on Windows NT. WARNING!!!! There are some very complex
unwind cases associated with the methods used by MS in their file systems.
On SMP, however, their model is unequalled from structly a parallelism
standpoint.

The next step is to build something like MCB lists that will allow your file
cache to bypass the file system drivers. There are some complex concurrency
issues here. NT deals with them, in rather complex ways (the MCB lists in
essence become the file allocation storage method rather than the
proprietary methods each file system driver uses). The advantages here are
intutitive. having the file systems architected this way allows pretty
incredible parallelism on SMP because multiple requesters can have their
requests transacted in parallel by the cache manager because it KNOWS ALL
THE FILE RUN INFO THE EACH DEVICE'S FILES WTHOUT NEEDING TO KNOW OR CARE
ABOUT HOW EACH FILE SYSTEM STORES META DATA FOR A FILE.

What would really make this easy to understand from a Linux standpoint were
if I open sourced the Windows NT portions of FENRIS that were removed so you
could see the differences between FENRIS NT and FENRIS Linux. I will ask MS
if they have no objections (MS worked with us on this, and they have some
ownership rights of the Windows NT portions of FENRIS not released). I do
not think they would object, but I will ask. I think FENRIS is probably
oneof the first file systems that has been single-sourced between NT and
Linux.

Jeff

----- Original Message -----
From: Ingo Molnar <mingo@chiara.csoma.elte.hu>
To: Jeff Merkey <jmerkey@timpanogas.com>
Sent: Monday, June 21, 1999 1:25 PM
Subject: Re: Fw: FENRIS (nwfs) Source Code Now Available.

>
> i think you should compare the 2.3.7 kernel, that has the latest and
> greatest pagecache enhancements. mm/filemap.c is the central point.
>
> Ingo
>
> On Mon, 21 Jun 1999, Jeff Merkey wrote:
>
> > Ingo,
> >
> > I'll look at 2.2 and compare it. Right now I'm working on a code drop
for
> > today. I apologize if I cannot jump right on this for you, but I am
trying
> > to get some stuff Alan Cox asked me to check in finished. As soon as
this
> > is finished then I'll go through the code. Which files is this in in
2.2?
> >
> > Thanks
> >
> > Jeff
> >
> >
> >
> > ----- Original Message -----
> > From: Ingo Molnar <mingo@chiara.csoma.elte.hu>
> > To: Jeff Merkey <jmerkey@timpanogas.com>
> > Cc: <linux-kernel@vger.rutgers.edu>
> > Sent: Monday, June 21, 1999 11:15 AM
> > Subject: Re: Fw: FENRIS (nwfs) Source Code Now Available.
> >
> >
> > >
> > > On Mon, 21 Jun 1999, Jeff Merkey wrote:
> > >
> > > > > [...] Also, Windows NT has some pretty nifty stuff
> > > > > in it's file system architecture for SMP. The way the VM Cache
> > manager is
> > > > > implemented in Windows NT is nothing short of genius. By caching
MCB
> > > > > allocation run information (NT reads and writes files by
constructing
> > MCB
> > > > > lists in memory, which are in-memory descriptors of a file's
> > contiguous
> > > > > sector runs), this gives NT the ability to perform some fairly
> > > > > sophisticated read-ahead and cache predication operations for
random
> > > > > access as well as sequential access files. This architecture is
also
> > > > > superior for SMP operations. MCB lists can basically be sent in a
> > > > > parallel fashion to the IO subsystem and transacted in parallel.
[...]
> > >
> > > would you mind comparing it to what Linux 2.3.7's page cache does?
That
> > > one has pagecache design pretty close to 'genius' as well, and not
only on
> > > paper :) On a 4-ways Xeon we scale read performance ~380%, which is
> > > basically linear:
> > >
> > > single-process reads done on the same 60k file:
> > >
> > > [root@moon fileben]# ./fileben ./FILE 60000 1 1 1
> > > READ: 273.76 MB/sec
> > >
> > > 4-process read()s done on the same 60k file:
> > >
> > > [root@moon fileben]# ./fileben ./FILE 60000 4 1 1
> > > READ: 258.16 MB/sec
> > > READ: 258.09 MB/sec
> > > READ: 258.04 MB/sec
> > > READ: 258.20 MB/sec
> > >
> > > part of the rather dramatic speedup is also a new mechanizm: the new
> > > pagecache now also caches 'filesystem physical block addresses', which
> > > allows us to skip a costy lookup of filesystem metadata for every
block of
> > > file read() or written.
> > >
> > > -- mingo
> > >
> > >
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> > the body of a message to majordomo@vger.rutgers.edu
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/