Re: [PATCH 2/2] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race

From: James Morse
Date: Fri Nov 20 2020 - 09:53:49 EST


Hi Valentin,

On 18/11/2020 18:00, Valentin Schneider wrote:
> Upon moving a task to a new control / monitor group, said task's {closid,
> rmid} fields are updated *after* triggering the move_myself() task_work
> callback. This can cause said callback to miss the update, e.g. if the
> triggering thread got preempted before fiddling with task_struct, or if the
> targeted task was already on its way to return to userspace.

So, if move_myself() runs after task_work_add() but before tsk is written to.
Sounds fun!


> Update the task_struct's {closid, rmid} tuple *before* invoking
> task_work_add(). As they can happen concurrently, wrap {closid, rmid}
> accesses with READ_ONCE() and WRITE_ONCE(). Highlight the required ordering
> with a pair of comments.

... and this one is if move_myself() or __resctrl_sched_in() runs while tsk is being
written to on another CPU. It might get torn values, or multiple-reads get different values.

The READ_ONCE/WRITE_ONCEry would have been easier to read as a separate patch as you touch
all sites, and move/change some of them.

Regardless:
Reviewed-by: James Morse <james.morse@xxxxxxx>


I don't 'get' memory-ordering, so one curiosity below:

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index b6b5b95df833..135a51529f70 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -524,11 +524,13 @@ static void move_myself(struct callback_head *head)
> * If resource group was deleted before this task work callback
> * was invoked, then assign the task to root group and free the
> * resource group.
> + *
> + * See pairing atomic_inc() in __rdtgroup_move_task()
> */
> if (atomic_dec_and_test(&rdtgrp->waitcount) &&
> (rdtgrp->flags & RDT_DELETED)) {
> - current->closid = 0;
> - current->rmid = 0;
> + WRITE_ONCE(current->closid, 0);
> + WRITE_ONCE(current->rmid, 0);
> kfree(rdtgrp);
> }
>
> @@ -553,14 +555,32 @@ static int __rdtgroup_move_task(struct task_struct *tsk,

> /*
> * Take a refcount, so rdtgrp cannot be freed before the
> * callback has been invoked.
> + *
> + * Also ensures above {closid, rmid} writes are observed by
> + * move_myself(), as it can run immediately after task_work_add().
> + * Otherwise old values may be loaded, and the move will only actually
> + * happen at the next context switch.

But __resctrl_sched_in() can still occur at anytime and READ_ONCE() a pair of values that
don't go together?
I don't think this is a problem for RDT as with old-rmid the task was a member of that
monitor-group previously, and 'freed' rmid are kept in limbo for a while after.
(old-closid is the same as the task having not schedule()d since the change, which is fine).

For MPAM, this is more annoying as changing just the closid may put the task in a
monitoring group that never existed, meaning its surprise dirty later.

If this all makes sense, I guess the fix (for much later) is to union closid/rmid, and
WRITE_ONCE() them together where necessary.
(I've made a note for when I next pass that part of the MPAM tree)


> + *
> + * Pairs with atomic_dec() in move_myself().
> */
> atomic_inc(&rdtgrp->waitcount);
> +
> ret = task_work_add(tsk, &callback->work, TWA_RESUME);
> if (ret) {
> /*


Thanks!

James