Re: kernel thread support - LWP's

Jamie Lokier (lkd@tantalophile.demon.co.uk)
Fri, 16 Jul 1999 01:25:51 +0200


Larry McVoy wrote:
> : well, context switches are painful as is any kernel crossing in high
> : performance computing. imagine user level networking on high speed
> : connections that can have round trip times in the ~50us range (this is
> : a software implementation in our lab, SGI's GSN is committed to round
> : trip times of around 7us roundtrip hardware latency), if you
>
> I've (a) spent a great deal of time thinking about this very issue, and
> (b) worked on GSN at SGI, and (c) am under contract with LLNL working
> on exactly this issue, amongst others. I'm pretty in tune with the
> problem space and I don't see that it has any bearing on the discussion
> at all. If you are going to context switch for each packet, you can
> kiss your performance good bye whether you are context switching threads
> or processes. Neither are fast enough to hit the needed 10 usec round
> trip time that all the HPC folks like LLNL want.

Hi Larry, there is someone in our group at CERN also working on user
level threads. His measurements (benchmarks in L1 cache of course) are
0.05 microseconds for context switch in user space.

Now you can say that a real app will swamp this with cache misses. But
when it's within the cache, ~2-3 microseconds kernel vs. 0.05
microseconds user is a pretty severe difference.

BTW these context switches are triggered by arrival of network packets,
using polling rather then interrupt-driven code. It's a bit unusual,
but hey it's just a special case of an event coming from user space
rather than the kernel. "Normal" threaded apps have threads sending
each other messages, which is very similar.

> Agreed with the first part, couldn't agree with the second part - it ain't
> happening - the context switches will be kernel level context switches
> whether they are "threads" or "processes" since the event generated
> is a kernel level event.

Now you're generalising... the system here responds to events entirely
in user space.

> Yeah, you can deliver the packet into user space directly, but have
> fun getting the kernel to tell your user level scheduler to run a new
> thread. Sure it can be done, and has been done, but an old quote of
> mine is "Architect: someone who knows the difference between what
> could be done and what should be done". My architect hat says this is
> not "a should be done", your view may be different.

I kinda agree that polling a device using modified-compiler generated
code does not look like the right way at first... but this model is the
only one I know of where an Intel box can saturate a Gigabit Ethernet
link in both directions at once, with 6% CPU load and consistently <50
microseconds response latency (min. 25 microseconds).

I'm not advertising as it's not my work. Just observing that no other
model is close to this performance as far as I know. Undoubtedly the
hardware will get better in time to make up for this. (Deferred
interrupts, lower interrupt latency, faster kernel/user switches etc.)

> Caches aren't infinite in size. Yeah, it's true that benchmarks fit
> nicely in the L1 cache so we see these nice 1-3 usec context switch
> numbers.

Or 0.05 usec. But I agree real programs stress caches. Unless the new
K7's large cache is large enough?

> And the truth is that context switch times are _not_ represented by
> the 2 process, 0 size case. That's a benchmark. In the real world,
> you switch to that context to do something and that is going to have a
> cost that can exceed the context switch by multiple orders of
> magnitude.

I agree -- now I only wish the figures from the benchmarks here weren't
raising so many eyebrows and encouraging the work...

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/