Re: [patch V3 17/20] sched/mmcid: Provide CID ownership mode fixup functions

From: Mathieu Desnoyers

Date: Thu Oct 30 2025 - 11:51:08 EST


On 2025-10-29 09:09, Thomas Gleixner wrote:

At the point of switching to per CPU mode the new user is not yet visible
in the system, so the task which initiated the fork() runs the fixup
function: mm_cid_fixup_tasks_to_cpu() walks the thread list and either
transfers each tasks owned CID to the CPU the task runs on or drops it into
the CID pool if a task is not on a CPU at that point in time. Tasks which
schedule in before the task walk reaches them do the handover in
mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes it's
guaranteed that no task related to that MM owns a CID anymore.

Switching back to task mode happens when the user count goes below the
threshold which was recorded on the per CPU mode switch:

pcpu_thrs = min(opt_cids - (opt_cids / 4), nr_cpu_ids / 2);


AFAIU this provides an hysteresis so we don't switch back and
forth between modes if a single thread is forked/exits repeatedly,
right ?


did not cover yet do the handover themself.

themselves


This transition from CPU to per task ownership happens in two phases:

1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the task
CID and denotes that the CID is only temporarily owned by the
task. When it schedules out the task drops the CID back into the
pool if this bit is set.

OK, so the mm_drop_cid() on sched out only happens due to a transition
from per-cpu back to per-task. This answers my question in the previous
patch.



2) The initiating context walks the per CPU space and after completion
clears mm:mm_cid.transit. After that point the CIDs are strictly
task owned again.

This two phase transition is required to prevent CID space exhaustion
during the transition as a direct transfer of ownership would fail if
two tasks are scheduled in on the same CPU before the fixup freed per
CPU CIDs.

Clever. :-)


+ * Switching to per CPU mode happens when the user count becomes greater
+ * than the maximum number of CIDs, which is calculated by:
+ *
+ * opt_cids = min(mm_cid::nr_cpus_allowed, mm_cid::users);
+ * max_cids = min(1.25 * opt_cids, num_possible_cpus());
[...]
+ * Switching back to task mode happens when the user count goes below the
+ * threshold which was recorded on the per CPU mode switch:
+ *
+ * pcpu_thrs = min(opt_cids - (opt_cids / 4), num_possible_cpus() / 2);

I notice that mm_update_cpus_allowed() calls __mm_update_max_cids() before updating the pcpu_thrs threshold.

sched_mm_cid_{add,remove}_user() only invoke mm_update_max_cids(mm)
without updating pcpu_thrs first.

Are those done on purpose ?

Thanks,

Mathieu



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com