Re: [PATCH 01/10] cgroup/cpuset: Fix race between newly created partition and dying one

From: Waiman Long
Date: Mon Mar 31 2025 - 23:12:23 EST

Next message: K Prateek Nayak: "Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle"
Previous message: Carlos Llamas: "Re: [PATCH] binder: do not crash on bad transaction in binder_thread_release()"
In reply to: Tejun Heo: "Re: [PATCH 01/10] cgroup/cpuset: Fix race between newly created partition and dying one"
Next in thread: Waiman Long: "[PATCH 03/10] cgroup/cpuset: Fix error handling in remote_partition_disable()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 3/31/25 7:13 PM, Tejun Heo wrote:

Hello,

On Sun, Mar 30, 2025 at 05:52:39PM -0400, Waiman Long wrote:
...

One possible way to fix this is to iterate the dying cpusets as well and
avoid using the exclusive CPUs in those dying cpusets. However, this
can still cause random partition creation failures or other anomalies
due to racing. A better way to fix this race is to reset the partition
state at the moment when a cpuset is being killed.

I'm not a big fan of adding another method call in the destruction path.
css_offline() is where the kill can be seen from all CPUs and notified to
the controller and I'm not sure why bringing it sooner would be necessary to
close the race window. Can't the creation side drain the cgroups that are
going down if the asynchronous part is a problem? e.g. We already have
cgroup_lock_and_drain_offline() which isn't the most scalable thing but
partition operations aren't very frequent, right? And if that's a problem,
there should be a way to make it reasonably quicker.

The problem is the RCU delay between the time a cgroup is killed and is in a dying state and when the partition is deactivated when cpuset_css_offline() is called. That delay can be rather lengthy depending on the current workload.

Another alternative that I can think of is to scan the remote partition list for remote partition and sibling cpusets for local partition whenever some kind of conflicts are detected when enabling a partition. When a dying cpuset partition is detected, deactivate it immediately to resolve the conflict. Otherwise, the dying partition will still be deactivated at cpuset_css_offline() time.

That will be a bit more complex and I think can still get the problem solved without adding a new method. What do you think? If you are OK with that, I will send out a new patch later this week.

Thanks,
Longman

Next message: K Prateek Nayak: "Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle"
Previous message: Carlos Llamas: "Re: [PATCH] binder: do not crash on bad transaction in binder_thread_release()"
In reply to: Tejun Heo: "Re: [PATCH 01/10] cgroup/cpuset: Fix race between newly created partition and dying one"
Next in thread: Waiman Long: "[PATCH 03/10] cgroup/cpuset: Fix error handling in remote_partition_disable()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]