Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Martin J. Bligh
Date: Tue Mar 30 2004 - 10:04:01 EST

--Erich Focht <efocht@xxxxxxxxxxxx> wrote (on Tuesday, March 30, 2004 00:30:25 +0200):

> On Thursday 25 March 2004 23:28, Martin J. Bligh wrote:
>> Can we hold off on changing the fork/exec time balancing until we've
>> come to a plan as to what should actually be done with it? Unless we're
>> giving it some hint from userspace, it's frigging hard to be sure if
>> it's going to exec or not - and the vast majority of things do.
> After more than a year (or two?) of discussions there's no better idea
> yet than giving a userspace hint. Default should be to balance at
> exec(), and maybe use a syscall for saying: balance all children a
> particular process is going to fork/clone at creation time. Everybody
> reached the insight that we can't foresee what's optimal, so there is
> only one solution: control the behavior. Give the user a tool to
> improve the performance. Just a small inheritable variable in the task
> structure is enough. Whether you give the hint at or before run-time
> or even at compile-time is not really the point...

Agreed ... absolutely.

> I don't think it's worth to wait and hope that somebody shows up with
> a magic algorithm which balances every kind of job optimally.

Especially as I don't believe that exists ;-) It's not deterministic.

>> Clone is a much more interesting case, though at the time, I consciously
>> decided NOT to do that, as we really mostly want threads on the same
>> node.
> That is not true in the case of HPC applications. And if someone uses
> OpenMP he is just doing that kind of stuff. I consider STREAM a good
> benchmark because it shows exactly the problem of HPC applications:
> they need a lot of memory bandwidth, they don't run in cache and the
> tasks live really long. Spreading those tasks across the nodes gives
> me more bandwidth per task and I accumulate the positive effect
> because the tasks run for hours or days. It's a simple and clear case
> where the scheduler should be improved.
> Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
> are not relevant for HPC. In a compute center it actually doesn't
> matter much whether some shell command returns 10% faster, it just
> shouldn't disturb my super simulation code for which I bought an
> expensive NUMA box.

OK, but the scheduler can't know the difference automatically, I don't
think ... and whether we should tune the scheduler for "user work" or
HPC is going to be a hotly contested point ;-) We need to try to find
something that works for both. And suppose you have a 4 node system,
with 4 HPC apps running? Surely you want each app to have one node to
itself? That's more the case I'm worried about than "user work" vs HPC,
to be honest.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at