Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

From: Mathieu Desnoyers

Date: Tue Nov 11 2025 - 11:42:33 EST

On 2025-11-10 09:23, Mathieu Desnoyers wrote:

On 2025-11-06 12:28, Prakash Sangappa wrote:
[...]

Hit this watchdog panic.

Using following tree. Assume this Is the latest.
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/ slice

Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
https://lore.kernel.org/all/20251029123717.886619142@xxxxxxxxxxxxx/

When this happened during the development of the "complex" mm_cid
scheme, this was typically caused by a stale "mm_cid" being kept around
by a task even though it was not actually scheduled, thus causing
over-reservation of concurrency IDs beyond the max_cids threshold. This
ends up looping in:

static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
        unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm- >mm_cid.max_cids));

        while (cid == MM_CID_UNSET) {
                cpu_relax();
                cid = __mm_get_cid(mm, num_possible_cpus());
        }
        return cid;
}

Based on the stacktrace you provided, it seems to happen within
sched_mm_cid_fork() within copy_process, so perhaps it's simply an
initialization issue in fork, or an issue when cloning a new thread ?

I've spent some time digging through Thomas' implementation of
mm_cid management. I've spotted something which may explain
the watchdog panic. Here is the scenario:

1) A process is constrained to a subset of the possible CPUs,
and has enough threads to swap from per-thread to per-cpu mm_cid
mode. It runs happily in that per-cpu mode.

2) The number of allowed CPUs is increased for a process, thus invoking
mm_update_cpus_allowed. This switches the mode back to per-thread,
but delays invocation of mm_cid_work_fn to some point in the future,
in thread context, through irq_work + schedule_work.

At that point, because only __mm_update_max_cids was called by
mm_update_cpus_allowed, the max_cids is updated, but mc->transit
is still zero.

Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
scheduled work or near the end of sched_mm_cid_fork, or by
sched_mm_cid_exit, we are in a state where mm_cids are still
owned by CPUs, but we are now in per-thread mm_cid mode, which
means that the mc->max_cids value depends on the number of threads.

3) At that point, a new thread is cloned, thus invoking
sched_mm_cid_fork. Calling sched_mm_cid_add_user increases the user
count and invokes mm_update_max_cids, which updates the mc->max_cids
limit, but does not set the mc->transit flag because this call does not
swap from per-cpu to per-task mode (the mode is already per-task).

Immediately after the call to sched_mm_cid_add_user, sched_mm_cid_fork()
attempts to call mm_get_cid while the mm_cid mutex and mm_cid lock
are held, and loops forever because the mm_cid mask has all
the max_cids IDs reserved because of the stale per-cpu CIDs.

I see two possible issues here:

A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
mode without setting the mc->transit flag.

B) sched_mm_cid_fork calls mm_get_cpu() before invoking
mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
mm_cids and make them available for mm_get_cpu().

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com