Re: Interesting scheduling times

Larry McVoy (lm@bitmover.com)
Tue, 22 Sep 1998 00:06:09 -0600


: You make it sound like I don't know what I'm measuring. This is
: incorrect.

Umm, err, if that is true, then you should have an answer as to why
your tests have such wild variance. My claim that you don't know what
you are measuring is based on your varying results being dismissed
with hand waving. I'm not sure if you understand how most people go
about benchmarking, but I believe that they all pretty much know what
the answer is going to be ahead of time. When the results come back,
it either confirms you knew what you were talking about or it tells you
that you don't understand what is going on. I respectfully suggest that
you fall into the latter camp and will stay there until you can explain
your variance and prove your explanation.

I don't really care what you do, personally. But if you show up on this
list with that benchmark and expect people to change the kernel based
on results you can't explain, then I'm gonna speak up and point out the
flaws.

I'm not saying you are right or that you are wrong. I'm saying your
results are essentially uninteresting until you can explain what they
mean, all of them, not just the ones that you want to present.

: Using these values I can then look at where time can be
: saved. Further, my tests have shown variability, which has led me to
: investigate the cause of that variability. If the variability is due
: to caching problems, I can look at how to minimise the effects.

Come on, think about it. What's in the cache that could cause this
much variance? You keep saying caching problems and I keep telling you
that that can't be it; and I can prove it by demonstrating a benchmark
that measures what you claim to measure and doesn't have the variance.
Not only that, you can think about how much code and data is involved
here and actually work out exactly what the number should be. Your mins
are close but your averages and maxes are way out of line. And I can't
think of an explanation that would include those out of line numbers.
And you haven't either. So go back, do the homework and figure out
what is going on when things vary that much. Something must be getting
weird, what is it? Why don't other benchmarks see it?

It's a useful exercise, by the way. Every time I'm in your shoes (and
I've been there a lot), I gain a great deal of insight by figuring out
what is going on. Who knows, maybe you'll bump into some great discovery
in the process. Stranger things have happened.

: I've been able
: to reorder struct task_struct and reduce the context switch latency as
: a result of my analysis (from 0.2 us per run queue process to 0.15
: us).

I believe that if you go back and read what I wrote in one of the first
postings on this, I suggested that and suggested that you could get it
down to one cache line miss.

: Part of this is the interrupt latency. I've seen people looking at
: this on the list, but I've not seen much attention being paid to how
: long it takes to wake a process up and have it start running
: (i.e. switch out whatever is running now and switch in my RT task).

The reason that nobody is looking at it is because it isn't a significant
problem. People are looking at interrupt latency because it needs to
be faster.

: What matters is that when it has
: to wake up, it does so as quickly as possible. One of the overheads
: here is the length of the run queue. Having separate run queues will
: mean you are one step closer to safely putting more "normal" load on a
: machine without worrying about what it does to your RT latencies.

You are worried about the wrong problem. You're dealing with a 3rd or
wth or 5th order term while ignoring the first order terms.

I have a question for you. What do you thin would happen if you took
that benchmark you have and actually touched some data between each
sched_yield()?

I have another question for you. How about adding a RT run queue and
run your system (not your benchmark) on the modified system. If what you
think is true, you ought to be able to measure a throughput difference,
right? So let's get some numbers from a real application, not a flawed
benchmark.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/