Re: kernel thread support - LWP's

Charles K Hardin (chardin+@andrew.cmu.edu)
Thu, 15 Jul 1999 15:26:41 -0400 (EDT)


Excerpts from mail.linux: 15-Jul-99 Re: kernel thread support -.. by
Larry McVoy@bitmover.com
>
>Yes. It's a much nicer model. If you've worked in both, the Linux (aka
>Plan 9) model is trivial - you already know how to do everything. The only
>new parts are the parts that are actually new - such as locking to prevent
>data modification races. In the LWP world, you now get to learn about
>two kinds of processes, you have to invent new tools to monitor, debug,
>and administer these processes, you have to learn about thread scheduling
>versus process scheduling, thread signals versus process signals, etc.
>

depends on the definition of nicer, clone is less complicated to implement
except when you start dealing with signaling threads and implementing a
M to N mapping thread library.

>It's all a crock of doo doo that was deemed to necessary because
>"context switches cost too much".
>

well, context switches are painful as is any kernel crossing in high
performance computing. imagine user level networking on high speed
connections that can have round trip times in the ~50us range (this is
a software implementation in our lab, SGI's GSN is committed to round
trip times of around 7us roundtrip hardware latency), if you
have several outstanding connections and your kernel switch between
a thread per connection is around 20us then it sux to be you. of course
you could say that going to an event loop programming model would be
much more appropriate for this system, but if you can get a thread
model to handle it this type of load, then do it.

>First of all, it's extremely rare that anyone who says "context switches
>cost too much" actually knows what they cost, or knows if they are even
>within an order of magnitude of being a critical performance bottleneck.
>
>Second, for those rare cases where they actually do cost too much,
>that's only on crappy operating systems. The last time I checked, Linux
>process context switches were faster than Solaris LWP context switches.
>So much for that argument.
>
>Third, the context switch time on any half way decent OS quickly becomes
>dominated by cache misses caused by rebuilding the real process context.
>Context switch benchmarks typically (lmbench is a notable exception,
>shameless plug) do not measure anything but the OS context switch -
>i.e., how fast can I switch from one thread of execution to the next.
>If you get this code path small enough, you can fit the whole thing in
>the L1 cache like Linux does and get ~1 microsecond context switches.
>And the user level scheduler people just love to brag about numbers that
>are even better (though typically not much). But that's just benchmarking
>crap - people don't context switch to do nothing, they context switch to
>do some work. That work turns into cache misses. If you miss 5 or 6
>times, you're up to the Linux kernel level context. If you are talking
>about those wonderful userlevel schedulers, you are probably at the context
>switch time in 2-3 cache misses. Do you really think you are going to
>context switch in and do less than a half dozen cache missed before you
>go to the next thread? I doubt it. If I'm right, then the real context
>switch time is actually dominated by rebuilding your cache foot print,
>NOT by the context switch time: regardless of whether we are talking about
>kernel level, user level, whatever. So it's a marketing argument that
>the context switches are the issue, not a real argument.

so, i am really confused by this with the sample program attached below
and run against the MIT pthreads user level package and the glibc
linuxthreads implementation shows a bit of a bigger difference than you
suggest over a 2.2.9 kernel.

the cache miss domination argument doesn't seem to make alot of sense
unless you want to believe you are flushing the whole cache which you
aren't because you are sharing the VM space between the threads.

----------------------------- switch.c ---------------------------------
#include <pthread.h>
#include <stdio.h>


#define MHZ 400.0

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

unsigned long long times[30];
enum {thread1, thread2} owner;

unsigned long long
cycle_counter(void)
{
unsigned long low, high;
__asm__ __volatile__("rdtsc;"
"movl %%eax, %0;"
"movl %%edx, %1"
: "=r" (low), "=r" (high)
: /* no inputs */
: "eax", "edx");
return (((unsigned long long)high << 32) | (unsigned long long)low);
}

void *
pthread_routine(void *val)
{
int i;

pthread_mutex_lock(&lock);
for (i=0; i < 10; i++) {
while (owner != thread2) {
pthread_cond_wait(&cond, &lock);
}
times[i << 1] = cycle_counter();
owner = thread1;
pthread_cond_signal(&cond);
}
pthread_mutex_unlock(&lock);
return NULL;
}

void
main(void)
{
pthread_t thread;
int i;

owner = thread1;
pthread_mutex_lock(&lock);

pthread_create(&thread, NULL, pthread_routine, NULL);

for (i=0; i < 10; i++) {
owner = thread2;
pthread_cond_signal(&cond);
while (owner != thread1) {
pthread_cond_wait(&cond, &lock);
}
times[(i << 1) + 1] = cycle_counter();
}
pthread_mutex_unlock(&lock);

for (i=0; i < 20; i++) {
fprintf(stderr, "Switch: %#llx\n", times[i]);
}

{
double avg = 0.0, val;

// Don't show first or last. They are odd

for (i=2; i < 19; i++) {
val = ((double)(times[i] - times[i-1])) / MHZ;
fprintf(stderr, "Diff: %.2fus\n", val);
avg += val;
}
avg /= 17.0;
fprintf(stderr, "Average time: %.2fus\n", avg);
}
}
------------------------------------------------------------------------

-------------------------- Results for -lpthread ----------------------
Diff: 16.32us
Diff: 25.56us
Diff: 14.60us
Diff: 24.50us
Diff: 14.91us
Diff: 23.82us
Diff: 14.09us
Diff: 23.96us
Diff: 14.37us
Diff: 23.93us
Diff: 14.12us
Diff: 24.16us
Diff: 14.24us
Diff: 23.92us
Diff: 14.12us
Diff: 24.14us
Diff: 14.24us
Average time: 19.12us
-------------------------------------------------------------------

----------------- Results for pmpthreads-1.8.9 --------------------
Diff: 3.76us
Diff: 3.57us
Diff: 3.56us
Diff: 3.46us
Diff: 3.48us
Diff: 3.46us
Diff: 3.44us
Diff: 3.43us
Diff: 3.43us
Diff: 3.46us
Diff: 3.44us
Diff: 3.43us
Diff: 3.44us
Diff: 3.46us
Diff: 3.44us
Diff: 3.43us
Diff: 3.44us
Average time: 3.48us
------------------------------------------------------------

Results from a dual PII 400

------------------

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/