Re: [PATCH] sched/mmcid: fix OOB clear_bit when CID is MM_CID_UNSET in fixup path

From: Mathieu Desnoyers

Date: Tue Jun 16 2026 - 12:22:38 EST

On 2026-06-16 10:53, Rik van Riel wrote:

In mm_cid_fixup_cpus_to_tasks(), when rq->curr has the target mm and
mm_cid.active is set, the CID is checked with cid_in_transit() before
setting the transition bit. In per-CPU mode a newly forked or exec'd
task can be running with mm_cid.cid == MM_CID_UNSET because CIDs are
assigned lazily on schedule-in. With cid_in_transit() the guard passes
for MM_CID_UNSET (no transit bit), converts it to MM_CID_UNSET |
MM_CID_TRANSIT and stores it back; later mm_cid_schedout() feeds this
to clear_bit() with MM_CID_UNSET as the bit number, triggering an
out-of-bounds write.

Symptoms: this is genuine memory corruption, but a bounded out-of-bounds
write, not an arbitrary one. MM_CID_UNSET is the fixed sentinel BIT(31),
so once the bad value reaches mm_cid_schedout() the cid_from_transit_cid()
strip leaves MM_CID_UNSET, which fails the "cid < max_cids" convergence
test and falls into mm_drop_cid() -> clear_bit(MM_CID_UNSET,
mm_cidmask(mm)). The cid bitmap is embedded in the mm_struct slab object
(after cpu_bitmap and mm_cpus_allowed) and is only num_possible_cpus()
bits wide, so clearing bit 31 is a deterministic OOB bit-clear at a
fixed offset of 2^31 / 8 == 256 MiB past the bitmap base. The address is
not attacker-influenced (fixed sentinel -> fixed offset) and the op only
clears a single bit; what sits 256 MiB further along the direct map is
whatever kernel object happens to live there, so this corrupts one bit of
unpredictable kernel memory -- it is not an arbitrary-address or
arbitrary-value write.

It triggers only in per-CPU CID mode, when a CPU is running an active
task of the target mm whose cid is still MM_CID_UNSET -- the
fork()/execve() window before that task's next schedule-in assigns it a
real CID -- and a per-CPU -> per-task fixup walks over it (the mode
fallback driven by a thread exit, sched_mm_cid_exit(), or by the deferred
max_cids recompute in mm_cid_work_fn()).

In practice syzkaller surfaced it as a KASAN use-after-free reported in
__schedule -> mm_cid_switch_to, where the offending clear_bit() is inlined
via mm_cid_schedout() -> mm_drop_cid().

Switch to cid_on_task() which excludes MM_CID_UNSET, MM_CID_ONCPU, and
MM_CID_TRANSIT, so we only set the transition bit on a genuine
task-owned CID.

[...]

---
kernel/sched/core.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b791e9e9f67..4c8b6ca254ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10909,8 +10909,19 @@ static void mm_cid_fixup_cpus_to_tasks(struct mm_struct *mm)
} else if (rq->curr->mm == mm && rq->curr->mm_cid.active) {
unsigned int cid = rq->curr->mm_cid.cid;
- /* Ensure it has the transition bit set */
- if (!cid_in_transit(cid)) {
+ /*
+ * Only a genuine task-owned CID needs the transition
+ * bit. A running active task can legitimately have
+ * MM_CID_UNSET here: in per-CPU mode CIDs are assigned
+ * lazily on schedule-in, so fork()/execve() leave the
+ * task active with no owned CID until its next
+ * schedule-in. cid_on_task() excludes the
+ * MM_CID_UNSET/ONCPU/TRANSIT bits, so we never turn
+ * e.g. MM_CID_UNSET into MM_CID_UNSET|MM_CID_TRANSIT,
+ * which mm_cid_schedout() would later feed to
+ * clear_bit() as an out-of-bounds bit number.
+ */
+ if (cid_on_task(cid)) {

I agree that something is wrong with the "unset" bit handling here, but
I'm not sure I fully understand why we would gate the
"fixup_cpus_to_tasks" on "cid_on_task()".

AFAIU there are the following relevant states:

- cid_on_task: True if none of the MM_CID_ONCPU, MM_CID_TRANSIT, MM_CID_UNSET bits is set

- MM_CID_TRANSIT: what this fixup aims to set if not already set.

- MM_CID_UNSET: tag indicating that the mm_cid is unset. This triggers the issue here
because the "!cid_in_transit(cid)" check does not cover it.

- MM_CID_ONCPU: tag indicating that the cid is on cpu. Technically the "origin"
state we are trying to transition from.

The cid_on_task() check will eliminate the "unset" case, but will also eliminate
the "oncpu" case, which I suspect is the initial state we want to transition from.

Did you try changing this to the following (completely untested) check instead:

if (!cid_in_transit(cid) && !(cid & MM_CID_UNSET)) { ?

Thanks,

Mathieu

cid = cid_to_transit_cid(cid);
rq->curr->mm_cid.cid = cid;
pcp->cid = cid;

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com