Re: [RFC] cpuset: remove sched domain hooks from cpusets
From: Paul Jackson
Date: Fri Oct 20 2006 - 15:01:11 EST
> The patch I posted previously should (modulo bugs) only do partitioning
> in the top-most cpuset. I still need clarification from Paul as to why
> this is unacceptable, though.
That patch partitioned on the children of the top cpuset, not the
top cpuset itself.
There is only one top cpuset - and that covers the entire system.
Consider the following example:
/dev/cpuset cpu_exclusive=1, cpus=0-7, task A
/dev/cpuset/a cpu_exclusive=1, cpus=0-3, task B
/dev/cpuset/b cpu_exclusive=1, cpus=4-7, task C
We have three cpusets - the top cpuset and two children, 'a' and 'b'.
We have three tasks, A, B and C. Task A is running in the top cpuset,
with access to all 8 cpus on the system. Tasks B and C are each in
a child cpuset, with access to just 4 cpus.
By your patch, the cpu_exclusive cpusets 'a' and 'b' partition the
sched domains in two halves, each covering 4 of the systems 8 cpus.
(That, or I'm still a sched domain idiot - quite possible.)
As a result, task A is screwed. If it happens to be on any of cpus
0-3 when the above is set up and the sched domains become partitioned,
it will never be considered for load balancing on any of cpus 4-7.
Or vice versa, if it is on any of cpus 4-7, it has no chance of
subsequently running on cpus 0-3.
If your patch had been just an implicit optimization, benefiting
sched domains, by optimizing for smaller domains when it could do so
without any noticable harm, then it would at least be neutral, and
we could continue the discussion of that patch to ask if it provided
an optimization that helped enough to be worth doing.
But that's not the case, as the above example shows.
I do not see any way to harmlessly optimize sched domain partitioning
based on a systems cpuset configuration.
I am not aware of any possible cpuset configuration that defines a
partitioning of the systems cpus. In particular, the top cpuset
always covers all online cpus, and any task in that top cpuset can
run anywhere, so far as cpusets is concerned.
So ... what can we do.
What -would- be a useful partitioning of sched domains?
Not being a sched domain wizard, I can only hazard a guess, but I'd
guess it would be a partitioning that significantly reduced the typical
size of a sched domain below the full size of the system (apparently it
is quicker to balance several smaller domains than one big one), while
not cutting off any legitimate load balancing possibilities.
The static cpuset configuration doesn't tell us this (see the top
cpuset in the example above), but if one combined that with knowledge
of which cpusets had actively running jobs that should be load
balanced, then that could work.
I doubt we could detect this (which cpusets did or did not need to be
load balanced) automatically. We probably need to have user code tell
us this. That was the point of my patch that started this discussion
several days ago, adding explicit 'sched_domain' flag files to each
cpuset, so user code could mark the cpusets needing to be balanced.
Since proposing that patch, I've changed my recommendation. Instead
of using cpusets to drive sched domain partitioning, better to just
provide a separate API, specific to the needs of sched domains, by
which user code can partition sched domains. That, or make the
balancing fast enough, even on very large domains, that we don't need
to partition. If we do have to partition, it would basically be for
performance reasons, and since I don't see any automatic way to
correctly partition sched domains, I think it would require some
explicit kernel-user API by which user space code can define the
partitioning.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/