Re: exclusive cpusets broken with cpu hotplug

From: Paul Jackson
Date: Wed Oct 18 2006 - 17:08:34 EST


> You do, however, hopefully have enough information to create the
> calls you would make to partition_sched_domain if each had their
> cpu_exclusive flags cleared. Essentially, what I am proposing is
> making all the calls as if the user had cleared each as the
> remove/add starts, and then behave as if each each was set again.

Yes - hopefully we have enough information to rebuild the sched domains
each time, consistently. And your proposal is probably an improvement
for that reason.

However, I'm afraid that only solves half the problem. It makes the
sched domains more repeatable and predictable. But I'm worried that
the cpuset control over sched domains is still broken .. see the
example below.

I've half a mind to prepare a patch to just rip out the sched domain
defining code from kernel/cpuset.c, completely uncoupling the
cpu_exclusive flag, and any other cpuset flags, from sched domains.

Example:

As best as I can tell (which is not very far ;), if some hapless
user does the following:

/dev/cpuset cpu_exclusive == 1; cpus == 0-7
/dev/cpuset/a cpu_exclusive == 1; cpus == 0-3
/dev/cpsuet/b cpu_exclusive == 1; cpus == 4-7

and then runs a big job in the top cpuset (/dev/cpuset), then that
big job will not load balance correctly, with whatever threads
in the big job that got stuck on cpus 0-3 isolated from whatever
threads got stuck on cpus 4-7.

Is this correct?

If so, there no practical way that I can see on a production system for
the system admin to realize they have messed up their system this way.

If we can't make this work properly automatically, then we either need
to provide users the visibility and control to make it work by explicit
manual control (meaning my 'sched_domain' flag patch, plus some way of
exporting the sched domain topology in /sys), or we need to stop doing
this.

If the above example is not correct, then I'm afraid my education in
sched domains is in need of another lesson.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/