Re: [PATCH 2/2] rseq/selftests: Add test for mm_cid compaction

From: Gabriele Monaco
Date: Tue Feb 18 2025 - 03:14:19 EST




On Mon, 2025-02-17 at 14:59 -0500, Mathieu Desnoyers wrote:
> On 2025-02-17 06:23, Gabriele Monaco wrote:
> > A task in the kernel (task_mm_cid_work) runs somewhat periodically
> > to
> > compact the mm_cid for each process. Add a test to validate that it
> > runs
> > correctly and timely.
> >
> > The test spawns 1 thread pinned to each CPU, then each thread,
> > including
> > the main one, runs in short bursts for some time. During this
> > period, the
> > mm_cids should be spanning all numbers between 0 and nproc.
> >
> > At the end of this phase, a thread with high enough mm_cid (>=
> > nproc/2)
> > is selected to be the new leader, all other threads terminate.
> >
> > After some time, the only running thread should see 0 as mm_cid, if
> > that
> > doesn't happen, the compaction mechanism didn't work and the test
> > fails.
> >
> > Since mm_cid compaction is less likely for tasks running in short
> > bursts, we increase the likelihood by just running a busy loop at
> > every
> > iteration. This compaction is a best effort work and this behaviour
> > is
> > currently acceptable.
>
> I'm wondering what we can do to make this compaction scheme more
> predictable.
>
> The situation here is caused by the fact that the CID compaction
> only happens on scheduler tick. If the workload is periodic and
> runs in short bursts, chances are that the scheduler tick never
> issue task_tick_mm_cid() for a given process, so no compaction.
>
> So task_tick_mm_cid() basically does:
>
> void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
> {
>          struct callback_head *work = &curr->cid_work;
>          unsigned long now = jiffies;
>
>          if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD))
> ||
>              work->next != work)
>                  return;
>          if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
>                  return;
>
>          /* No page allocation under rq lock */
>          task_work_add(curr, work, TWA_RESUME | TWAF_NO_ALLOC);
> }
>
> So typically we have a "time_before()" check that is hit and
> paces the execution of this task_work every 100ms or so.
>
> If we have periodic tasks, that means those tasks are necessarily
> preempted so they are not current when the tick happens. If the
> task cares about compaction of mm_cid, it means it has returned
> to userspace after that preemption.
>
> Sooo, we happen to have code in kernel/rseq.c called exactly at
> that point:
>
> __rseq_handle_notify_resume()
>
> I wonder if we could perhaps just call task_tick_mm_cid() (or a
> version
> of it renamed to something more meaningful) from
> __rseq_handle_notify_resume() ? By combining time_before() checks
> from
> the scheduler tick and at return to userspace after preemption, AFAIU
> we'd be handling the periodic workload correctly, and therefore this
> test for mm_cid compaction could check for more robust guarantees.
>
> Thoughts ?

Alright, that seems better, since the task work already runs there
(resume_user_mode_work), it's only set as pending once we get the tick,
I agree that seems a bit redundant.

In this case I'd see calling the task_mm_cid_work where
rseq_handle_notify_resume is called and not the task_tick_mm_cid.

The way I see it, rseq_handle_notify_resume is behaving essentially
like a task_work with TWA_RESUME (setting TIF_NOTIFY_RESUME on its own
at syscalls, task switches and migrations). task_mm_cid_work, instead
sets TIF_NOTIFY_RESUME on ticks (via the task_work API). This last bit
could change, conceptually I mean, we probably don't want to use
task_work at all in such contexts.

Does this make sense to you?


However, I'm still not particularly fond of running stuff there at all.
If a periodic task needs to run now, it preempts everything else and
should be on its way as soon as possible. A task work is always going
to delay this, although by a tiny bit.

Again, for now I cannot think of a better way without bringing
workqueues into the picture, and in this specific case we have a valid
workaround to reduce the latency.

Thanks,
Gabriele