Re: [RFC] cpuset: add interface to isolated cpus
From: Paul Jackson
Date: Thu Oct 19 2006 - 23:38:27 EST
> > 1) If your other patch to manipulate sched domains
> > has code that belongs in kernel/cpuset.c, and
> > special files that belong in /dev/cpuset, why
> > shouldn't this one naturally go in the same places?
>
> Because they are cpuset specific. This is not.
Bizarre. That other patch of yours, that had cpu_exclusive cpusets
only at the top level defining sched domain partitions, is cpuset
specific only because you're hijacking the cpu_exclusive flag to add
new semantics to it, and then claiming you're not adding new semantics
to it, and then claiming we should be doing this all implicitly without
additional kernel-user API apparatus, and then claiming that if your
new abuse of cpuset flags intrudes on other usage patterns of cpusets
or if they have need of a particular sched domain partitioning that
they have to explicitly set up top level cpusets to get a desired sched
domain partitioning ...
Stop already ... this is getting insane.
Where are we?
By now, I think everyone (that is still following this thread ;)
seems to agree that the current code linking cpu_exclusive and
sched domain partitions is borked.
I responded to that realization earlier this week by proposing
a patch to add new sched_domain flags to each cpuset.
Suresh responded to this realization by writing:
> ...I don't know much about how job manager interacts with
> cpusets but this behavior sounds bad to me.
You responded by proposing a patch to use just the cpu_exclusive
flags in the top level cpusets to define sched domain partitions.
The current code linking cpu_exclusive and sched domain partitions
is borked, and serving almost no purpose. One of its goals was to
provide a way to workaround the scaling issues with the scheduler
load balancer code on humongous CPU count systems. This current code
hasn't helped us a bit there. We're starting to address these scaling
issues in various other ways.
So far as I can tell, the only actual successful application of this
current code to manipulate sched domain partitions has been by a real
time group, who have managed to isolate CPUs from load balancing with
this.
Everyone rejects the current code, and so far, for every replacement
suggested, at least one person rejects it. You don't like my patches,
and I don't like yours.
Sounds to me like we don't have concensus yet ;). Agreed?
And its not just the code that would try to define sched domains
via some implicit or explicit cpuset hooks that is in question here.
The sched domain related code in kernel/sched.c is overly complicated
and ifdef'd. I can't personally critique that code, because my brain
is too small. After repeated efforts to make sense of it, I still don't
understand it. But I have it on good authority from colleagues whose
brains are bigger than mine that this code could use an overhaul.
So ... not only do we not have concensus on the specific patch to control
the definition of sched domain partitions, this is moreover part of a larger
ball of yarn that should be unraveled.
We'd be fools to try to settle on the specifics of the (possibly cpuset
related) API to define sched domain partitions in isolation. We should
do that as part of such an overhaul of the rest of the sched domain code.
The current implicit hooks between cpu_exclusive cpusets and sched domain
partitions should be removed, as proposed in my patch:
[RFC] cpuset: remove sched domain hooks from cpusets
I will submit this patch to Andrew for inclusion in *-mm. It doesn't
change any explicit kernel API's - just removes the sched domain
side affects of the cpu_exclusive flag, which are currently borked
and more dangerous than useful. So long as any rework we do provides
some way to isolate CPUs from load balancing sometime in the next
several months, before whenever the one real time group doing this
needs to see it, no one will miss these current sched domain hooks in
cpusets.
I will agree to NAK your (Nick's) patch to overload the cpu_exclusive
flags in the top level cpusets with sched domain partitioning side
affects, if you agree to NAK my patches to create an 'isolated_cpus'
file in the top cpuset.
My colleague Christoph Lameter will be looking into overhauling some of
this the sched domain code. I am firmly convinced that any kernel
API's, implicit or explicit, cpuset related or not, involving these
sched domains and their partitioning should arise naturally out of that
effort, and not be band-aided on ahead of time.
Let's agree to give Christoph a clean shot at this, and join him in
this effort where we can.
Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/