Re: Interesting scheduling times - NOT

Richard Gooch (rgooch@atnf.csiro.au)
Thu, 24 Sep 1998 18:03:52 +1000

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: H. Peter Anvin: "Re: [PATCH] Speeding up FAT operations"
Previous message: Vojtech Pavlik: "Re: [PATCH] Speeding up FAT operations"

Larry McVoy writes:
> : At least at one point the background programs Richard used were a loop
> : of
> :
> : for (;;) getppid();
>
> Nope, that was me. But I also tried
>
> for (;;) i++; // i is a volatile int
> 11.00 10.58 10.48 10.46 10.42 10.41 10.40 10.38 10.06 10.00 9.85
>
> getppid()
> 9.95 9.65 9.61 9.48 9.38 9.31 9.22 9.21 9.12 9.10 9.08
>
>
> : which cannot be pre-empted and at least under 2.0.x gets the kernel
> : lock. That could certainly mess with timings quite unrelated to the
> : scheduler.

I use sched_yield() instead. The effect you mention shouldn't happen,
though, since I use SCHED_FIFO. Once the low-priority thread returns
from the sched_yield() syscall, schedule() is called and that process
never again will get control of the CPU until the SCHED_FIFO processes
block or exit.
Adding code to do explicit synchronisation with a pipe to ensure the
low-priority processes are running makes no difference.

> The issue is that Richard's variance is so high. It just shouldn't
> be that high. If you walk the code paths that have to happen for a
> process to call sched_yield(), there aren't enough cache misses
> likely to cause that much variance, and even if there were, the
> cache misses should stablize to some small range. I'm virtually
> positive that he's not measuring the same number of events from run
> to run. One way to prove this would be to have his code eat
> /proc/stat (or whatever it is) before and after and spit out the
> differences. My guess is that for the the runs that don't vary, the
> differences won't vary and vice versa.

Nope, this isn't it. I've added code to check /proc/stat before and
after I do the benchmark, and I get *exactly* the number of context
switches I expect. This is what I expected, since I had a test right
from the start which counted the number of sched_yield() calls
performed in the reader. Obviously I knew how many sched_yield() calls
were being performed in the main loop. Everything checks out.
Even using a pipe to pass a token gives the same results.
Whatever the source of the variance in my measurements, I'm sure it's
not due to a variance in the number of context switches measured.

I've been looking deeper into the lmbench code, and one difference I
note is that lmbench uses floating point calculations a lot. In
particular, it uses them prior to and during the main benchmark. My
code should not do *any* floating point calculations until I print out
the final results.

I've also noted some variance in the lmbench results. When I launched
10 low-priority processes, I got one result from lmbench of 9.73
us. Later runs gave 7.12 us or similar. On another run I got 6.33 us.

On another set of lmbench runs (no low-priority processes), I got:
"size=0k ovr=6.23
2 5.12
2 3.97
2 3.96
2 3.94

yet a second later I got:
"size=0k ovr=6.23
2 5.12
2 5.12
2 5.13
2 4.35

so it's not hard to find a 30% variance in lmbench either.

Another datapoint is in the comparison between my test and lmbench
with 10 low-priority processes. lmbench then gives:
"size=0k ovr=6.23
2 7.44
2 6.00
2 5.97
2 5.98

so the per-process cost of extra processes on the run queue is up to
0.2 us or so. This is consistent with my own test code. What is
interesting is the discrepancy between the absolute times given by
lmbench (about 5 us) and my test (about 1 us), with no extra processes
on the run queue. My test yields the same absolute context switch time
irrespective of whether I use sched_yield() or passing tokens through
a pipe.
Note: I've only recently been taking syscall overheads into
account. Previously, I wasn't interested in absolute times, as I was
first focussing on the cost of run queue lengths and later tracking
down the variance in my tests.

So this raises another question as to why my test gives a much lower
context switch time than lmbench. While it's possible there is some
subtle flaw in my test, it's hard to see how I can come up with a much
smaller value than reality without some fairly obvious flaw.

An alternative hypothesis is that lmbench has some extra, unaccounted
for, overhead. If we speculate that lmbench has a 2 us overhead, then
the corrected times vary from 1.94 us to 3.12 us. This takes the
variance of the lmbench results to over 60%.

Now, I'm not saying there is a flaw in the lmbench code (I wouldn't
leap to such a conclusion without investigating the lmbench code), but
perhaps there is some other subtle effect that yields an effective
overhead or variance. Perhaps it is related to the use by lmbench of
floating point arithmetic. Maybe there is still some remaining problem
with lazy FPU state saving.

Regards,

Richard....

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: H. Peter Anvin: "Re: [PATCH] Speeding up FAT operations"
Previous message: Vojtech Pavlik: "Re: [PATCH] Speeding up FAT operations"