Re: Interesting scheduling times - NOT

Richard Gooch (rgooch@atnf.csiro.au)
Mon, 21 Sep 1998 17:56:30 +1000


Linus Torvalds writes:
> In article <199809182339.JAA06430@vindaloo.atnf.CSIRO.AU>,
> Richard Gooch <rgooch@atnf.csiro.au> wrote:
> >
> >On sucessive runs of the code (no other changes made, no-one else on
> >the system, no X running), I got 8.2 us and 17.5 us with no extra
> >processes.
>
> Umm, what do you use to make sure that all processes start at the same
> time? Ie, what's your "start" criterium?

The processes don't all need to start at the same time. All that
really matters is that they have all started by the time I do the
measurement. Hence the usleep (20000).

> It's entirely possible that a large portion of your run is done only
> with a part of the processes running... As such, your "minimum"
> values in particular are especially suspect, because they may be
> "minima" with no scheduling happening at all, or scheduling between
> just a few of the runs.

I'm pretty sure this doesn't happen: the usleep (20000) gives enough
time for the yielding process/thread to start doing it's thing.
Furthermore, I have a santity check in the code whereby I count the
number of yields in the yielder. For the threaded yielder case, this
should be exactly 100 000, and it is. For the forked yielder case, it
should be a little over 100 000, and again it is.

I assure you I've thought about these issues and I'd know if the
yielder wasn't started yet. I also understand how the scheduler is
supposed to schedule the two SCHED_FIFO processes (both from the
advertised interface and by grokking the actual code).

I am sure that when my main loop goes through 10 calls to
sched_yield(), there have been a total of 20 context switches. You
don't need to pass tokens through a pipe to guarantee a know sequence
of context switches.

> >Another pair of successive runs, this time with 12 extra low-priority
> >processes, I got 35.2 us and 51.0 us. Pentium 100 (this is an old
> >Intel Neptune board).
> >
> >I've now modified my test code to compute minimum and maximum
> >latencies (no extra processes):
> >Minimum scheduling latency: 6.2 us
> >Average scheduling latency: 15.0 us
> >Maximum scheduling latency: 40.5 us
> > and on the very next run:
> >Minimum scheduling latency: 6.2 us
> >Average scheduling latency: 8.6 us
> >Maximum scheduling latency: 38.0 us
> > and a few runs later:
> >Minimum scheduling latency: 7.7 us
> >Average scheduling latency: 11.8 us
> >Maximum scheduling latency: 145.9 us
> >
> >Looking at the first pair of results, we can see that the minimum
> >scheduling time is relatively stable and the maximum isn't too bad
> >either. But the average does indeed change a lot.
>
> Which is entirely consistent with the idea that you have a "random"
> amount of processes during the first and last few thousand
> iterations.

And it's also entirely consistent with cache interactions, I hope
you'll agree. Further, there is no reason to think that in the last
few thousand iterations the low-priority processes are gone, and
during the first few thousand iterations I don't see why there should
be any problems either, as I leave sufficient time for them to get
started.

Also, I get variability even *without* the extra processes, even in
the minimum times. I can run the test from one window and get
reasonably consistent differences with another window.
Now, one difference I can think of between those two windows is that
one is in a different place in the list of tasks. That implies cache
effects.

> This is why using a "packet" based approach is so stable, and why using
> "sched_yield()" is so unstable. "lmbench" makes _sure_ that it always
> gets exactly the expected re-scheduling behaviour by passing along a
> packet in a pair of pipes. You have no such synchronization as far as I
> can tell.
>
> > This must mean that
> >the distribution of scheduling times has changed from run to run, and
> >I suspect memory subsystem/cache interactions.
>
> There could certainly be cache effects, but if I were you I'd look very
> very closely at synchronization issues in your program.

Yes, I have, and I'm confident synchronisation isn't a problem.

> This could easily explain why you get different results the first time
> you run the program: you get page-faults in one or more of the threads,
> so part of the yield() loop won't result in any schedule at all simply
> because there is only one process runnable, for example.

That's why I run through 20 context switches before I start measuring:
that way the yielder has gone through 10 calls to sched_yield(). It
doesn't matter if it took 1 day to page in the code, since I don't
start measuring until later.
And once pages are in, they will stay in.
Furthermore, I do all these tests with a warm buffer/page cache (the
PPro 180 has 192 MBytes of RAM and right now has about 110 MBytes
free, and yet I still get variability).

> You even have a "usleep()" somewhere in your benchmark, and it's
> fairly unrealistic to expect the different yield loops to be
> synchronized to each other..

Nope, that's not what the usleep() is there for. It's just there to
ensure that the low-priority processes are really on the run queue
before I start measuring. It doesn't matter for the yielder process,
since I've got a 100% mechanism to ensure it's running.
And, as I said before, I get variability even without the low-priority
processes.

Finally, I've even added an option to pass tokens via pipes and I get
the same results (modulo a few extra us because of pipe
overheads). This shows that I'm correct in my use of sched_yield().
I get the variability and the extra time for each process on the run
queue.

The good news is that with the FPU saving fix you posted for 2.1.122
the variability has been reduced, and the cost of extra processes on
the run queue also appears to have been reduced (hard to say for sure
because of the variability).

On a PPro 180 each extra process on the run queue increases the
scheduling overhead by 0.2 us. For a Pentium 100 it's 0.91 us.
Latest results/code at:
http://www.atnf.csiro.au/~rgooch/benchmarks.html

Regards,

Richard....

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/