Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
From: Linus Torvalds
Date: Mon Nov 26 2007 - 21:15:55 EST
On Mon, 26 Nov 2007, H. Peter Anvin wrote:
>
> I'm presuming you're not talking about some sort of syslets/fibrils/threadlets
> here (executing an interpreted thread of execution in kernel space). That's a
> whole separate ball of wax.
Indeed.
I'm hoping that just dies. It's too complex. But the "do this single
system call asynchronously" isn't, and has lots of historical
implementations, ranging from VMS to the braindead POSIX "aio" setup.
I do think that more complex threadlets could be useful in theory, I just
doubt they'd be used in practice..
> > So the choice is basically one of:
> >
> > - come up with a totally new interface to system calls, and effectively
> > duplicating the whole system call table.
> >
> > I'd hate to do this. We already have duplicated system call tables due
> > to compat stuff, it's painful.
>
> This would be the right thing to do if we were to redesign the system call
> interface from the ground up, which it doesn't exactly sound like we are
> intending.
Yeah. I'm also not sure it's the right thing even if we did redesign from
scratch.
The current system call interface may look less than regular, but it has
some very solid foundation: it's fast. Passing arguments in registers is
by definition a lot faster *and*safer* than passing them any other way.
There are no subtle security issues with people playing games with the
argument base pointer (ie usually the stack pointer) and trying to fool
the kernel into accessing kernel memory etc.
Immediately when you do anything but registers, it is much *much* more
costly. The "get_user()" and "copy_from_user()" stuff is not exactly slow,
but it's quite noticeable overhead for simple system calls. It gets worse
if this all is described by some indirect table setup.
In the system call path, right now, for some system calls, the biggest two
overheads are
- the CPU system call overhead itself. We can't do much about this, but
the CPU designers do seem to be slowly getting it fixed (ie it's slower
than it should need to be, but it's a hell of a lot faster than a P4
used to be ;)
- the cost of just the single indirect - and unpredictable - call.
(The second cost is actually often totally hidden in the trivial system
call benchmarks people run: if the benchmark just does "getppid()" a
million times in a tight loop, the indirect call on the system call number
seems really quite fast, but outside of benchmarks it is generally totally
unpredictable indeed, and a real cost for real-life system call usage).
Everything else in the system call path is generally as fast as we can
make it. Doing more indirection and conditionals would be really quite
nasty.
Of course, for *most* of system calls, the work the kernel actually does
ends up being so big that it doesn't much matter, but I was literally
chasing down why a page fault had slowed down by ~70 cycles two weeks ago.
And it doesn't take more than a couple of unpredictable jumps to do things
like that!
> The 6-word limit is a red herring. There is at least two ways to deal with it
> (and this doesn't mean wiping the legacy stuff we already have):
>
> - Let each architecture pick a calling convention and redefine the
> architecture-independent bits to take an arbitrary number of arguments. This
> is a one-time panarchitectural change.
Not applicable on x86-32.
The six-word limit is effectively a hardware limit there. Once it goes
past that limit, one of the words needs to be a pointer to extended
information that is fundamentally slower to access. Happily, only very
rare system calls do that (and none of them are of the simple variety
where we see a few cycles easily).
On other architectures, we could more easily just use more registers. But
x86-32 is still a big part (bulk) of what matters for most people.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/