Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads

From: escape
Date: Thu Sep 04 2025 - 23:45:20 EST


Hello,

在 2025/9/5 10:27, Tejun Heo 写道:
Hello,

On Fri, Sep 05, 2025 at 10:16:30AM +0800, escape wrote:
+ if (have_favordynmods)
+ up_read(&tsk->signal->group_rwsem);
percpu_up_read(&cgroup_threadgroup_rwsem);
Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents
have_favordynmods flipping while a task is between change_begin and end?
have_favordynmods is read-only after initialization and will not change
during runtime.
I don't think that's necessarily true. favordynmods can also be specified as
a mount option and mount can race against forks, execs and exits. Also,
IIRC, cgroup2 doesn't allow remounts but there's nothing preventing someone
from unmounting and mounting it again with different options.
There are two ways to set favordynmods. The first is through kernel config or
the kernel command line parameter cgroup_favordynmods, which sets the value of
the variable have_favordynmods and automatically sets the CGRP_ROOT_FAVOR_DYNMODS
flag on cgroup_root->flags at mount time. The second is by passing a mount option,
which only sets the CGRP_ROOT_FAVOR_DYNMODS flag on cgroup_root->flags during mount.

In this patch, the decision whether to use the per-threadgroup rwsem is based on
have_favordynmods, not on cgroup_root->flags. An umount & mount affects cgroup_root->flags,
but does not change have_favordynmods.

Indeed, there is a minor flaw here: if favordynmods is enabled via a mount option rather
than the kernel command line, the per-threadgroup rwsem will not take effect.

To avoid the complexity of runtime changes to favordynmods, perhaps a better approach
would be to introduce a separate control variable, configured via a kernel command line
parameter such as cgroup_migration_lock=[global|thread_group] to explicitly govern
this behavior. What do you think about this approach?
@@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
*/
if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
tsk = ERR_PTR(-EINVAL);
- goto out_unlock_threadgroup;
+ goto out_unlock_rcu;
}
-
get_task_struct(tsk);
- goto out_unlock_rcu;
-out_unlock_threadgroup:
- cgroup_attach_unlock(*threadgroup_locked);
- *threadgroup_locked = false;
+ rcu_read_unlock();
+
+ /*
+ * If we migrate a single thread, we don't care about threadgroup
+ * stability. If the thread is `current`, it won't exit(2) under our
+ * hands or change PID through exec(2). We exclude
+ * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
+ * callers by cgroup_mutex.
+ * Therefore, we can skip the global lock.
+ */
+ lockdep_assert_held(&cgroup_mutex);
+ *threadgroup_locked = pid || threadgroup;
+
+ cgroup_attach_lock(tsk, *threadgroup_locked);
I'm not sure this relocation is safe. What prevents e.g. @tsk changing its
group leader or signal struct before lock is grabbed?
When a non-leader thread in a thread group executes the exec system call,
the thread group leader is updated, but the signal_struct remains unchanged,
so this part is safe.
But the leader can change, right? So, we can end up in a situation where
threadgroup is set but the task is not the leader which I think can lead to
really subtle incorrect behaviors like write succeeding but nothing
happening when racing against exec.

Thanks.

You are right. If the thread group leader changes, the task passed to
cgroup_attach_task may not be the leader, which could lead to errors when
iterating through all threads in while_each_thread.


Similar to commit b78949ebfb56 ("cgroup: simplify double-check locking in
cgroup_attach_proc"), it should check whether the leader has changed after
acquiring the lock, to determine if a retry is needed. I will fix this in
the next version patch.

Thanks.