Re: Scheduling Times --- Revisited

Richard Gooch (rgooch@atnf.csiro.au)
Sun, 27 Sep 1998 16:40:10 +1000


Larry McVoy writes:
> teamwork@freemail.c3.hu:
> [a bunch of really reasonable stuff - thanks]
>
> I'd like to start out by saying I feel crappy about my part in this
> whole discussion. I'm good at proving I have more to learn. I
> don't want to discourage people like Richard - that isn't the right
> answer. The right answer is to help him do better work and
> encourage better thinking, in a supportive way, and I haven't done
> that. So I'm sorry for screwing that up, that's my problem.

Well, thanks. This sounds more positive. However, I still think I
should point out that a phrase like "encourage better thinking" is
still not that tactful.

> : "Have you ever considered that maybe you're
> : worrying about a problem that doesn't exist?"
> :
> : How do you know that, Larry?
>
> I don't, I just have a strong hunch. It's easy to prove me wrong -
> add the changes and measure an application (not a benchmark) and see
> if it changes.
>
> I think what is getting lost here is that we are talking about tiny
> units of time. Unless you have a single process application with
> nothing else going on, to think that you are actually going to tune
> the system to the point that you'll notice the extra cache
> misse[s]/ctx switch is pretty unlikely. Possible yes, but also
> quite unlikely.

I don't know why you'r focussing on the cache issues. Yes, I raised
these issues, but essentially as a side issue (regarding a possible
cause of variance).

My main focus is on the overhead of a unified run queue. While I did
post a patch to reorder task_struct to minimise cache effects, that
was really a second-order effect that I was investigating. The main
game is still the time it takes to process the run queue, even if you
have a warm cache and no aliasing or other problems.

> : The problem _may_ not be staring at our collective faces right now,
> : but there is no proof it doesn't exist.
>
> Nor is there evidence that it is a problem in the real world and
> there is quite a bit of past evidence that it isn't. That's the
> issue in a nutshell. We've been discussing this ad nauseum with no
> justification in the form of an application. It's just Richard's
> claim that this will be a problem. Maybe yes, maybe no - isn't it
> reasonable to motivate things from applications?

A few points here. Firstly, you said a number of times that "real" or
"correct" applications don't have a large run queue. I've measured a
"real" application that can generate a run queue length of 10
processes. You've then gone on to say the application is badly
designed and should be changed. For us, and other institutions like
us, that is *simply not an option*. Our resources are limited (hence
the worldwide collaborative effort in the first place), and after 50+
man-years of effort, it *isn't* going to be thown away. Ignoring
technical issues, the powers-that-be in this arena aren't going to
make a radical change. This software has to work. And it does
work. The use of this software for our new instrument (a world leader,
BTW) online data reduction demonstrates that it *does* work.

Secondly, you have asserted that it won't be a problem for "realistic"
applications which only have a few processes on the run queue. My
tests on a Pentium 100 show that a mere 3 extra processes on the run
queue doubles the scheduling/wakeup latency of an RT processes.

Thirdly, you say that the latency will be lost in the noise. I don't
think this is true. While a cold cache will of course increase the RT
wakeup latency, from my task_struct reodering experiment, this appears
to be a second order effect. The greatest cost is scanning the run
queue. This is expensive even with a warm cache. The other
consideration you seem to have is how long the RT process will run for
compared to the wakeup latency. This will of course vary with the
application, but some of our RT applications have threads which run
for a very short time (read from blocking device, compute a new value
and write, place the recently read value into SHM, unlock a semaphore
and read again (block)). These are high-priority threads that are a
small step above a device driver. Latency matters here, since we want
to write out the new value ASAP.

Finally, you want me to show you how the existing system "hurts" a
real application. This is hard for me to do, because the existing
system is running under pSOS+. I'm working from the position of
convincing people to do their new developments (or upgrading old
systems) using Linux and not pSOS+. As such, I've got to convince them
that Linux can hack it in the real world. While RT-Linux has some
appeal, it also has disadvantages. It is a difficult programming
environment (RT tasks are really kernel modules) and doesn't have the
familiar IPC facilities (mutexes, semaphores, message queues and
such).

One of the advantages to using Linux for RT applications (compared to
pSOS+) is that your development environment is the same as your
execution environment. Being able to edit, compile and link your new
RT application, and then send SIGHUP to the running RT application and
have it restart the new version of the code is one of the strong
appeals for using Linux, and one I don't hesitate to mention ;-)
Compare this with the pSOS+ development cycle, which is primitive in
comparison. RT-Linux lies somewhere in between.

Another contender is OSF/1 from Digital. This will have a similar
development environment as Linux, and is already being used for some
RT work here. Those of us who would rather see Linux running here want
to see the gap between Linux and anything else large, and ever
widening (with Linux being in front). It sometimes takes several
technical arguments to overcome a single political/emotional
argument. Ask yourself this: where did all the VMS dinosaurs go? Some
went to VMS++ (NT) and some went to OSF/1 from Digital. There are
non-technical pressures to overcome.

So, Larry, in the end I can't (at least not now) point to a running
application that suffers because of the unified run queue in
Linux. But I've done careful measurements and analysis to understand
the issues and put up numbers showing potential problems. I've talked
about existing and new applications we have here and shown how they
can suffer due to the current behaviour. I think I've put forward a
quite reasonable case.
Please bear in mind I'm also faced with a chicken-and-egg problem: to
get a real RT application running under Linux requires that people
take the plunge. However, people first want to be convinced that it's
safe to take the plunge.

Please also note that my interest in this has been on the basis of
trying to answer people's questions. Questions like "what's the
latency" and "is Linux pre-emptive". I didn't even go looking for the
effects of run queue length: I stubled across them while doing some
basic measurements.

I probably would never have noticed this effect if I'd used lmbench
(no, I'm not denigrating lmbench: I'm just pointing out the benefit of
diversity), since the FPU save/restore time dominates the effect at
small run queue lengths (< ~5). It's because someone queried the
variance when running under X (a mere 2 extra processes on the run
queue) that I got suspicious.

> I'm sorry if it sounds like I don't want to change anything - that
> is not my position. I just want to see some solid engineering when
> it comes to important parts of the system. I'm probably especially
> touchy about the scheduler because (a) I think the Linux scheduler
> is pretty good, and (b) I had a lot to do with why it is so good, I
> was able to demonstrate how bad the old one was (remember the one
> queue for all processes scheduler?).

Don't get me wrong: I'm not saying that the current scheduler is
crap. I think my measurements have show how damn good Linux is. I just
think there is room for improvement in an area where every little bit
counts.
I've been pretty careful in my technique. I'm not doing "vodoo
engineering". My arguments are well thought out. It's OK if you aren't
convinced of them, but that doesn't mean that it's vodoo.

> So look at the proposed changes by Richard. (A) non of these effect
> regular systems in any measurable way other than a toy benchmark
> (lmbench or his, I don't care, the 0 sized process is a toy
> benchmark). (B) The tests he is using vary way more than the effect
> of the change. (C) He hasn't gone from application to tests, he
> just thinks the application will see the effects.

A: I'm not talking about regular systems. I'm talking about shared
RT/user systems. And the benchmark is absolutely *not* a toy, just
because it doesn't manipulate large blocks of data. A properly
designed RT system leaves large data manipulation to low-priority
processes where latency doesn't matter.

B: The variance is now less than the effect of a large (~10) run
queue, since I've tracked down some "uninteresting" effects.

C: See above for why I don't have an application ready to run and
test: existing ones are for pSOS+. However, from an understanding of
what our existing applications do, the latency is *not* lost in the
noise.

> It's really the last one that is an issue. It's pretty obvious to
> any systems person that we are talking about a very small effect.
> It's very questionable to claim that that effect will be seen by
> applications. I'm not saying it won't, but I /am/ saying that it
> isn't very likely. And I am qualified to say that, I'm a systems
> person with a fair bit of experience in this area. That /doesn't/
> mean I'm right, it means that it is an area that is not obvious.
> Given that, and given that other systems people share that opinion,
> the burden for justification starts to fall on the person that is
> proposing the change.

We *do* have applications under pSOS+ where latency matters. It's
reasonable to assume that under Linux latency will still matter.

> No, we don't wait for problems to arise. When there is an /obvious/
> problem and someone has a reasonable fix, Linus just applies it
> because both parts are obvious.

This isn't true. I've seen many performance tuning patches on this
list. Many have not been based on any real applications that have had
problems, and a fair number have been in minor code paths. That hasn't
stopped them being included. This happens all the time in Linux. If
the change is simple and doesn't hurt other things, it tends to go
in.

The separate run queues idea isn't going to bloat the kernel either,
before you raise that argument. Some code paths will in fact be
simplified and made more robust. The bug in goodness() is a classic
case of this.

> In my opinion, this is neither an obvious problem nor an obvious
> fix. I think the whole area of RT scheduling could stand with some
> serious thinking and this is just a bandaid fix. Am I the only
> person that noticed that some of the context switch benchmarks got
> slower when we used SHED_FIFO? Seems like a possible problem,
> doesn't it?

I've explained that days ago. What was happening was that my shell,
xterm and X server were on the run queue and stayed there for the
duration of the benchmark, and hence slowed down context switching
(which just demonstrates my point). When I added a 0.2 s sleep to give
them time to get off the run queue, that effect went away.

Also, from looking at the scheduler code, I don't see how SCHED_FIFO
can give larger context switch times compared to SCHED_OTHER (when you
have the same run queue length for both). The !SCHED_OTHER code paths
are in fact shorter.

> Anyway, it's not at all obvious to me that Richard's changes will
> make any difference to an application. Richard has repeatedly
> complained that I don't show him enough respect and he's right. So
> I'm trying to fix that. On the other hand, he's not exactly showing
> a lot of respect to the systems people (not just me) that are not
> exactly jumping up and down with enthusiasm for his changes.

So you're saying that if I make some measurements, and see what I
think is a problem and post that, it's disrepectful? I'm entitled to
disagree with you and say so. As are you. Respect is about being
polite, not denigrating people or being abusive.

> I think Richard could do some great work here, he could take on the
> scheduler and make Linux have the best behaviour and performance
> possible. If he wants to really do the job right, I'd love to help
> out (maybe he doesn't want my help, that's fine too). But so far,
> it's my opinion that this whole process has been voodoo engineering
> and I don't think that's a basis for positive change.

It's not fair to say it vodoo engineering. It seems to me that you
don't accept that the small, latency-sensitive applications we have
here (currently implemented with pSOS+) actually exist.
That doesn't sound very respectful. Are you calling me a liar?

I didn't make this stuff up. I just walked down the corridor and asked
about the applications and what they do. This all started because the
RT application authors asked questions about Linux and RT performance.
That in itself should say something about "real applications".

To get to your other point: general improvements to the scheduler, I'm
not sure what you have in mind. It looks pretty good already. Apart
from the RT/long run queue latency issue, the only things that
occurred to me is the recalculation of counters. This walks the entire
process table. It would be nice to be able to avoid that.

Regards,

Richard....

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/