Re: [PATCH v3 5/5] cpusets, suspend: Save and restore cpusets duringsuspend/resume

From: David Rientjes
Date: Tue May 15 2012 - 14:31:20 EST

On Mon, 14 May 2012, Nishanth Aravamudan wrote:

> > If you do set_mempolicy(MPOL_BIND, 2-3) to bind a thread to nodes 2-3
> > that is attached to a cpuset whereas cpuset.mems == 2-3, and then
> > cpuset.mems changes to 0-1, what is the expected behavior? Do we
> > immediately oom on the next allocation? If cpuset.mems is set again
> > to 2-3, what's the desired behavior?
> "expected [or desired] behavior" always makes me cringe. It's usually
> some insane user-level expectations that don't really make sense :).

Yeah, and I think we should be moving in a direction where this behavior
is defined so that nobody can expert anything else.

> Cpusets are integrated with the sched_setaffinity(2) scheduling
> affinity mechanism and the mbind(2) and set_mempolicy(2)
> memory-placement mechanisms in the kernel. Neither of these
> mechanisms let a process make use of a CPU or memory node that
> is not allowed by that process's cpuset. If changes to a
> process's cpuset placement conflict with these other mechanisms,
> then cpuset placement is enforced even if it means overriding
> these other mechanisms.

This makes perfect sense because an admin wants to be able to move the
cpuset placement of a thread regardless of whether that thread did
sched_setaffinity() or mbind() itself so that it is running on a set of
isolated nodes that have affinity to its cpus. I agree that cpusets
should always take precedent.

However, if a thread did set_mempolicy(MPOL_BIND, 2-3) where cpuset.mems
== node_online_map, cpuset.mems changes to 0-1, then cpuset.mems changes
back to node_online_map, then I believe (and implemented in the mempolicy
code and added the specification in the man page) that the thread should
be bound to nodes 2-3.

> > I fixed this problem by introducing MPOL_F_* flags in set_mempolicy(2)
> > by saving the user intended nodemask passed by set_mempolicy() and
> > respecting it whenever allowed by cpusets.
> So, if you read that thread, this is what (in essence) Srivatsa proposed
> in v2. We store the user-defined cpumask and keep it regardless of
> kernel decisions. We intersect the user-defined cpumask with the kernel
> (which is really reflecting the administrator's hotplug decisions)
> topology and run tasks in constrained cpusets on the result. We reflect
> this decision in a new read-only file in each cpuset that indicates the
> "actual" cpus that a task in a given cpuset may be scheduled on.

I don't think we need a new read-only file that exposes the stored
cpumask, I think it should be stored and respected when possible and the
set of allowed cpus be exported in the way it always has been, through

> But PeterZ nack-ed it and his reasoning was sound -- CPU (and memory, I
> would think) hotplug is a necessarily destructive behavior.

>From a thread perspective, how is hot-removing a node different from
clearing the node's bit in cpuset.mems?

How is hot-adding a node different from setting the node's bit in

> > Right now, the behavior of what happens for a cpuset where cpuset.cpus ==
> > 2-3 and then cpus 2-3 go offline and then are brought back online is
> > undefined.
> Erm, no it's rather clearly defined by what actually happens. It may not
> be "specified" in a formal document, but behavior is a heckuva thing.

"Undefined" in the sense that there's no formal specification for what the
behavior is; of course it has a current behavior just like gcc compiles
1-bit int fields to be signed although its behavior is undefined. You'll
be defining the behavior with this patchset.

> What happens is that the offlining process pushes the tasks in that
> constrained cpuset up into the parent cpuset (actually moves them). In a
> suspend case, since we're offlining all CPUs, this results in all task
> being pushed up to the root cpuset.
> I would also quote `man cpuset` here to actually say the behavior is
> "specified", technically:
> If hot-plug functionality is used to remove all the CPUs that
> are currently assigned to a cpuset, then the kernel will
> automatically update the cpus_allowed of all processes attached
> to CPUs in that cpuset to allow all CPUs.

Right, and that's consistent because the root cpuset requires all cpus.

> > The same is true of cpuset.cpus during resume. So if you're going to
> > add a cpumask to struct cpuset, then why not respect it for all
> > offline events and get rid of all this specialized suspend-only stuff?
> > It's very simple to make this consistent across all cpu hotplug events
> > and build suspend on top of it from a cpuset perspective.
> "simple" -- sure. Read v2 of the patchset, as I said. But then read all
> the discussion that follows and I think you will see that this has been
> hashed out before with similar reasoning on both sides, and that the
> policy side of things is not obviously simply. The resulting decision
> was to special-case suspend, but not "remember" state across other
> hotplug actions, which is more of an "unintentional hotplug" (and from
> what Paul McKenney mentions in that thread, sounds like tglx is working
> on patches to remove the full hotplug usage from s/r).

We're talking about two very different things. The suspend case is
special in this regard _only_ because it moves all threads to the root
cpuset and obviously you can't have a user-specified cpumask for the root
cpuset. That's irrelevant to my question about why we aren't storing the
user-specified cpumask in all non-root cpusets, which certainly remains
consistent even with suspend since these non-root cpusets cease to exist.

If a cpuset is defined to have cpuset.cpus == 2-3, cpu 3 is offlined, and
then cpu 3 is onlined, the behavior is currently undefined. You could
make the argument that cpusets is purely about NUMA and that cpu 3 may no
longer have affinity to cpuset.mems in which case I would agree that we
should not reset cpuset.cpus to 2-3 in this case. But that doesn't seem
to be the motivation here because we keep talking about suspend.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at