Re: [PATCH 2/2] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race

From: Valentin Schneider
Date: Fri Nov 20 2020 - 10:54:59 EST



Hi James,

On 20/11/20 14:53, James Morse wrote:
> Hi Valentin,
>
> On 18/11/2020 18:00, Valentin Schneider wrote:
>> Upon moving a task to a new control / monitor group, said task's {closid,
>> rmid} fields are updated *after* triggering the move_myself() task_work
>> callback. This can cause said callback to miss the update, e.g. if the
>> triggering thread got preempted before fiddling with task_struct, or if the
>> targeted task was already on its way to return to userspace.
>
> So, if move_myself() runs after task_work_add() but before tsk is written to.
> Sounds fun!
>
>
>> Update the task_struct's {closid, rmid} tuple *before* invoking
>> task_work_add(). As they can happen concurrently, wrap {closid, rmid}
>> accesses with READ_ONCE() and WRITE_ONCE(). Highlight the required ordering
>> with a pair of comments.
>
> ... and this one is if move_myself() or __resctrl_sched_in() runs while tsk is being
> written to on another CPU. It might get torn values, or multiple-reads get different values.
>
> The READ_ONCE/WRITE_ONCEry would have been easier to read as a separate patch as you touch
> all sites, and move/change some of them.
>

True, I initially only fixed up the reads/writes involved with
__rdtgroup_move_task(), but ended up coccinelle'ing the whole lot - which I
should have then moved to a dedicated patch. Thanks for powering through
it, I'll send a v2 with a neater split.

> Regardless:
> Reviewed-by: James Morse <james.morse@xxxxxxx>
>

Thanks!

>
> I don't 'get' memory-ordering, so one curiosity below:
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index b6b5b95df833..135a51529f70 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -524,11 +524,13 @@ static void move_myself(struct callback_head *head)
>> * If resource group was deleted before this task work callback
>> * was invoked, then assign the task to root group and free the
>> * resource group.
>> + *
>> + * See pairing atomic_inc() in __rdtgroup_move_task()
>> */
>> if (atomic_dec_and_test(&rdtgrp->waitcount) &&
>> (rdtgrp->flags & RDT_DELETED)) {
>> - current->closid = 0;
>> - current->rmid = 0;
>> + WRITE_ONCE(current->closid, 0);
>> + WRITE_ONCE(current->rmid, 0);
>> kfree(rdtgrp);
>> }
>>
>> @@ -553,14 +555,32 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>
>> /*
>> * Take a refcount, so rdtgrp cannot be freed before the
>> * callback has been invoked.
>> + *
>> + * Also ensures above {closid, rmid} writes are observed by
>> + * move_myself(), as it can run immediately after task_work_add().
>> + * Otherwise old values may be loaded, and the move will only actually
>> + * happen at the next context switch.
>
> But __resctrl_sched_in() can still occur at anytime and READ_ONCE() a pair of values that
> don't go together?

Yes, the thought did cross my mind...

> I don't think this is a problem for RDT as with old-rmid the task was a member of that
> monitor-group previously, and 'freed' rmid are kept in limbo for a while after.
> (old-closid is the same as the task having not schedule()d since the change, which is fine).
>
> For MPAM, this is more annoying as changing just the closid may put the task in a
> monitoring group that never existed, meaning its surprise dirty later.
>
> If this all makes sense, I guess the fix (for much later) is to union closid/rmid, and
> WRITE_ONCE() them together where necessary.
> (I've made a note for when I next pass that part of the MPAM tree)
>

It does make sense to me - one more question back to you: can RDT exist on
an X86_32 system? It shouldn't be a stopper, but would be an inconvenience.

FWIW kernel/sched/fair.c uses two synced u64's for this; see

struct cfs_rq { .min_vruntime, .min_vruntime_copy }

and

kernel/sched/fair.c:update_min_vruntime()
kernel/sched/fair.c:migrate_task_rq_fair()
>
>> + *
>> + * Pairs with atomic_dec() in move_myself().
>> */
>> atomic_inc(&rdtgrp->waitcount);
>> +
>> ret = task_work_add(tsk, &callback->work, TWA_RESUME);
>> if (ret) {
>> /*
>
>
> Thanks!
>
> James