Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2
From: Waiman Long
Date: Mon May 28 2018 - 14:31:58 EST
On 05/28/2018 08:45 AM, Peter Zijlstra wrote:
> On Thu, May 24, 2018 at 02:55:25PM -0400, Waiman Long wrote:
>> On 05/24/2018 11:43 AM, Peter Zijlstra wrote:
>>> I'm confused... why exactly do we have both domain and load_balance ?
>> The domain is for partitioning the CPUs only. It doesn't change the load
>> balancing state. So the load_balance flag is still need to turn on and
>> off load balancing.
> OK, so we have to two boolean flags, giving 4 possible states. Lets just
> go through them one by on:
>
> A) domain:0 load_balance:0 -- we have no exclusive domain, but have
> load-balancing disabled across them. AFAICT this should be an invalid
> state.
>
> B) domain:0 load_balance:1 -- we have no exclusive domain, but have
> load-balancing enabled. AFAICT this is the default state and is a
> no-op.
>
> C) domain:1 load_balance:0 -- we have an exclusive domain, and have
> load-balancing disabled across it. This is, AFAICT, identical to
> having a bunch of sub/sibling groups each with a single CPU domain.
>
> D) domain:1 load_balance:1 -- we have an exclusive domain, and have
> load-balancing enabled. This is a partition.
>
> Now, I think I've overlooked the fact that load_balance==1 only really
> means something when the parent's load_balance==0, but I'm not sure that
> really changes anything.
>
> So, afaict, the above only have two useful states: B and D. Which again
> raises the question, why two knobs? What useful configurations does it
> allow?
I am working on the v9 patch, and below is the current draft of the
documentation. Hopefully that will clarify some of the concepts that we
are discussing here.
cpuset.sched.domain_root
A read-write single value file which exists on non-root
cpuset-enabled cgroups. It is a binary value flag that accepts
either "0" (off) or "1" (on). This flag is set by the parent
and is not delegatable.
If set, it indicates that the current cgroup is the root of a
new scheduling domain or partition that comprises itself and
all its descendants except those that are scheduling domain
roots themselves and their descendants. The root cgroup is
always a scheduling domain root.
There are constraints on where this flag can be set. It can
only be set in a cgroup if all the following conditions are true.
1) The "cpuset.cpus" is not empty and the list of CPUs are
exclusive, i.e. they are not shared by any of its siblings.
2) The parent cgroup is also a scheduling domain root.
3) There is no child cgroups with cpuset enabled. This is
for eliminating corner cases that have to be handled if such
a condition is allowed.
Setting this flag will take the CPUs away from the effective
CPUs of the parent cgroup. Once it is set, this flag cannot
be cleared if there are any child cgroups with cpuset enabled.
Further changes made to "cpuset.cpus" is allowed as long as
the first condition above is still true.
A parent scheduling domain root cgroup cannot distribute all
its CPUs to its child scheduling domain root cgroups unless
its load balancing flag is turned off.
cpuset.sched.load_balance
A read-write single value file which exists on non-root
cpuset-enabled cgroups. It is a binary value flag that accepts
either "0" (off) or "1" (on). This flag is set by the parent
and is not delegatable. It is on by default in the root cgroup.
When it is on, tasks within this cpuset will be load-balanced
by the kernel scheduler. Tasks will be moved from CPUs with
high load to other CPUs within the same cpuset with less load
periodically.
When it is off, there will be no load balancing among CPUs on
this cgroup. Tasks will stay in the CPUs they are running on
and will not be moved to other CPUs.
The load balancing state of a cgroup can only be changed on a
scheduling domain root cgroup with no cpuset-enabled children.
All cgroups within a scheduling domain or partition must have
the same load balancing state. As descendant cgroups of a
scheduling domain root are created, they inherit the same load
balancing state of their root.
The main purpose of using a new domain_root flag is to enable user to
create new partitions without the trick of disabling load_balance in the
parent and enabling it in the child. Now, we can create as many
partitions as we want without ever turning off load balancing in any of
the cpusets. I find it to be more straight forward and easier to
understand than using the load_balance trick.
Of course, turning off load balancing is still useful in some use cases,
so it is supported. To simplify thing, it is mandated that all the
cpusets within a partition must have the same load balancing state. This
is to ensure that we can't use the load_balance trick to create
additional partition underneath it. The domain_root flag is the only way
to create partition.
A) domain_root: 0, load_balance: 0 -- a non-domain root cpuset within a
no load balancing partition.
B) domain_root: 0, load_balance: 1 -- a non-domain root cpuset within a
load balancing partition.
C) domain_root: 1, load_balance: 0 -- a domain root cpuset of a no load
balancing partition.
D) domain_root: 1, load_balance: 1 -- a domain root cpuset of a load
balancing partition.
Hope this help.
Cheers,
Longman