Re: XFS and journalling filesystems

Larry McVoy (lm@bitmover.com)
Mon, 24 May 1999 11:24:35 -0600


I'm in 100% agreement on Ted's general line of reasoning, but there is
one area that could use some clarification:

: (And there are certain features of XFS, such as the features that allow
: Irix to tell the disk controller to send disk blocks directly to the
: ethernet controller, which then slaps on the TCP header and calclates
: the TCP checksum without the disk data ever hitting memory

This is not quite right, in fact, it is a little unfair to IRIX. There is
no interaction between the file system and/or the block device system
and the networking stack. The way it works is this (I know this code
extremely well since I'm the guy that originally made NFS use both the
networking and the file system to go at 94MByte/sec - Ethan Solomita is
the guy who made it go at 640MByte/second over Super HIPPI):

The short summary is that you DMA from disks to user VM and then from
user VM to networking (or the other way). But user VM is the currency.

The longer summary is that in order to get the file system to go fast,
you open up files with O_DIRECT and that tells the file system to
lock down the user pages and DMA directly to/from them, bypassing the
buffer cache completely (there is interaction with the VM layer here,
but it is two part: the locking down part and the invalidationg part;
the latter is so that if there was unflushed data in the buffer cache,
that got flushed out before the direct I/O occurred).

For the networking part, that works by page flipping on receive and COW
pages on send. Again, the currency is user VM. So if you were going
disk -> network, then the pages would be in your VM and you would do a
write(sock, buf, some_big_size). The socket layer would get all the way
down to sosend() and decide that this wad of data as a good candidate
for VM tricks. It calls out to the VM layer and asks that these pages
be marked COW. There was a lot of discussion about whether it would be
smart to set up the COW fault handler to sleep the faulting process until
the data had moved out - then naive processes would get slept and smart
processes - those which flip flopped between two buffers - would stream.
This optimization was never done.

If you were going network -> disk then the data would be coming and the
pages, if they were nicely aligned - which can be done by doing something
Vernon Shriver calls "tail aligning", you line up the end of the message
at the end of a page so that if the message is page sized it is page
aligned - then you could work them up the stack and when they hit the
top of stack, you "page flip" them into user space.

So while it is true there are VM interactions, they are all pretty much
in the networking stack.

As a result of my experience with this stuff, I got pretty disgusted with
the "design" even though the results were quite good. The obvious problem
was that the "currency" was user virtual memory, not physical memory. SO
I wrote up (or started to write up) a design based on physical memory,
which I called splice. There is a short paper about it on my ftp site
somewhere. Stephen Tweedie is currently implementing splice() semantics
for Linux (I'm very happy with his design, as usual - sometimes I think
that guy will solve world peace next but when I asked about that, he said
that was a user space problem :-)
[value is xlv]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/