Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
From: Mathieu Desnoyers
Date: Tue Nov 11 2025 - 11:42:33 EST
On 2025-11-10 09:23, Mathieu Desnoyers wrote:
On 2025-11-06 12:28, Prakash Sangappa wrote:
[...]
Hit this watchdog panic.
Using following tree. Assume this Is the latest.
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/ slice
Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
https://lore.kernel.org/all/20251029123717.886619142@xxxxxxxxxxxxx/
When this happened during the development of the "complex" mm_cid
scheme, this was typically caused by a stale "mm_cid" being kept around
by a task even though it was not actually scheduled, thus causing
over-reservation of concurrency IDs beyond the max_cids threshold. This
ends up looping in:
static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm- >mm_cid.max_cids));
while (cid == MM_CID_UNSET) {
cpu_relax();
cid = __mm_get_cid(mm, num_possible_cpus());
}
return cid;
}
Based on the stacktrace you provided, it seems to happen within
sched_mm_cid_fork() within copy_process, so perhaps it's simply an
initialization issue in fork, or an issue when cloning a new thread ?
I've spent some time digging through Thomas' implementation of
mm_cid management. I've spotted something which may explain
the watchdog panic. Here is the scenario:
1) A process is constrained to a subset of the possible CPUs,
and has enough threads to swap from per-thread to per-cpu mm_cid
mode. It runs happily in that per-cpu mode.
2) The number of allowed CPUs is increased for a process, thus invoking
mm_update_cpus_allowed. This switches the mode back to per-thread,
but delays invocation of mm_cid_work_fn to some point in the future,
in thread context, through irq_work + schedule_work.
At that point, because only __mm_update_max_cids was called by
mm_update_cpus_allowed, the max_cids is updated, but mc->transit
is still zero.
Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
scheduled work or near the end of sched_mm_cid_fork, or by
sched_mm_cid_exit, we are in a state where mm_cids are still
owned by CPUs, but we are now in per-thread mm_cid mode, which
means that the mc->max_cids value depends on the number of threads.
3) At that point, a new thread is cloned, thus invoking
sched_mm_cid_fork. Calling sched_mm_cid_add_user increases the user
count and invokes mm_update_max_cids, which updates the mc->max_cids
limit, but does not set the mc->transit flag because this call does not
swap from per-cpu to per-task mode (the mode is already per-task).
Immediately after the call to sched_mm_cid_add_user, sched_mm_cid_fork()
attempts to call mm_get_cid while the mm_cid mutex and mm_cid lock
are held, and loops forever because the mm_cid mask has all
the max_cids IDs reserved because of the stale per-cpu CIDs.
I see two possible issues here:
A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
mode without setting the mc->transit flag.
B) sched_mm_cid_fork calls mm_get_cpu() before invoking
mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
mm_cids and make them available for mm_get_cpu().
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com