Re: Linux 2.6.17-rc2

From: Linus Torvalds
Date: Wed Apr 19 2006 - 17:50:01 EST

Next message: Matt Mackall: "Re: sata suspend resume ..."
Previous message: Yuichi Nakamura: "Re: [RESEND][RFC][PATCH 2/7] implementation of LSM hooks"
In reply to: Trond Myklebust: "Re: Linux 2.6.17-rc2"
Next in thread: Peter Naulls: "Re: Linux 2.6.17-rc2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 19 Apr 2006, Trond Myklebust wrote:
>
> Any chance this could be adapted to work with all those DMA (and RDMA)
> engines that litter our motherboards? I'm thinking in particular of
> stuff like the drm drivers, and userspace rdma.

Absolutely. Especially with "vmsplice()" (the not-yet-implemented "move
these user pages into a kernel buffer") it should be entirely possible to
set up an efficient zero-copy setup that does NOT have any of the problems
with aio and TLB shootdown etc.

Note that a driver would have to support the splice_in() and splice_out()
interfaces (which are basically just given the pipe buffers to do with as
they wish), and perhaps more importantly: note that you need specialized
apps that actually use splice() to do this.

That's the biggest downside by far, and is why I'm not 100% convinced
splice() usage will be all that wide-spread. If you look at sendfile(),
it's been available for a long time, and is actually even almost portable
across different OS's _and_ it is easy to use. But almost nobody actually
does. I suspect the only users are some apache mods, perhaps a ftp deamon
or two, and probably samba. And that's probably largely it.

There's a _huge_ downside to specialized interfaces. Admittedly, splice()
is a lot less specialized (ie it works in a much wider variety of loads),
but it's still very much a "corner-case" thing. You can always do the same
thing splice() does with a read/write pair instead, and be portable.

Also, the genericity of splice() does come at the cost of complexity. For
example, to do a zero-copy from a user space buffer to some RDMA network
interface, you'd have to basically keep track of _two_ buffers:

- keep track of how much of the user space buffer you have moved into
kernel space with "vmsplice()" (or, for that matter, with any other
source of data for the buffer - it might be a file, it might be another
socket, whatever. I say "vmsplice()", but that's just an example for
when you have the data in user space).

The kernel space buffer is - for obvious reasons - size limited in the
way a user-space buffer is not. People are used to doing megabytes of
buffers in user space. The splice buffer, in comparison, is maybe a few
hundred kB at most. For some apps, that's "inifinity". For others, it's
just a few tens of pages of data.

- keep track of how much of the kernel space buffer you have moved to the
RDMA network interface with "splice()".

The splice buffer _is_ another buffer, and you have to feed the data
from that buffer to the RDMA device manually.

In many usage schenarios, this means that you end up having the normal
kind of poll/select loop. Now, that's nothing new: people are used to
them, but people still hate them, and it just means that very few
environments are going to spend the effort on another buffering setup.

So the upside of splice() is that it really can do some things very
efficiently, by "copying" data with just a simple reference counted
pointer. But the downside is that it makes for another level of buffering,
and behind an interface that is in kernel space (for obvious reasons),
which means that it's somewhat harder to wrap your hands and head around
than just a regular user-space buffer.

So I'd expect this to be most useful for perhaps things like some HPC
apps, where you can have specialized libraries for data communication. And
servers, of course (but they might just continue to use the old
"sendfile()" interface, without even knowing that it's not sendfile() any
more, but just a wrapper around splice()).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Matt Mackall: "Re: sata suspend resume ..."
Previous message: Yuichi Nakamura: "Re: [RESEND][RFC][PATCH 2/7] implementation of LSM hooks"
In reply to: Trond Myklebust: "Re: Linux 2.6.17-rc2"
Next in thread: Peter Naulls: "Re: Linux 2.6.17-rc2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]