Re: [PATCH 01/10] cgroup/cpuset: Fix race between newly created partition and dying one

From: Waiman Long
Date: Tue Apr 01 2025 - 16:56:57 EST



On 4/1/25 4:41 PM, Waiman Long wrote:

On 4/1/25 3:59 PM, Tejun Heo wrote:
Hello, Waiman.

On Mon, Mar 31, 2025 at 11:12:06PM -0400, Waiman Long wrote:
The problem is the RCU delay between the time a cgroup is killed and is in a
dying state and when the partition is deactivated when cpuset_css_offline()
is called. That delay can be rather lengthy depending on the current
workload.
If we don't have to do it too often, synchronize_rcu_expedited() may be
workable too. What do you think?

I don't think we ever call synchronize_rcu() in the cgroup code except for rstat flush. In fact, we didn't use to have an easy way to know if there were dying cpusets hanging around. Now we can probably use the root cgroup's nr_dying_subsys[cpuset_cgrp_id] to know if we need to use synchronize_rcu*() call to wait for it. However, I still need to check if there is any racing window that will cause us to miss it.

Sorry, I don't think I can use synchronize_rcu_expedited() as the use cases that I am seeing most often is the creation of isolated partitions running latency sensitive applications like DPDK. Using synchronize_rcu_expedited() will send IPIs to all the CPUs which may break the required latency guarantee for those applications. Just using synchronize_rcu(), however, will have unpredictable latency impacting user experience.



Another alternative that I can think of is to scan the remote partition list
for remote partition and sibling cpusets for local partition whenever some
kind of conflicts are detected when enabling a partition. When a dying
cpuset partition is detected, deactivate it immediately to resolve the
conflict. Otherwise, the dying partition will still be deactivated at
cpuset_css_offline() time.

That will be a bit more complex and I think can still get the problem solved
without adding a new method. What do you think? If you are OK with that, I
will send out a new patch later this week.
If synchronize_rcu_expedited() won't do, let's go with the original patch.
The operation does make general sense in that it's for a distinctive step in
the destruction process although I'm a bit curious why it's called before
DYING is set.

Because of the above, I still prefer either using the original patch or scanning for dying cpuset partitions in case a conflict is detected. Please let me know what you think about it.

Thanks,
Longman