Re: [RFC PATCH 1/2] Fix: sched/membarrier: p->mm->membarrier_state racy load

From: Mathieu Desnoyers
Date: Thu Sep 05 2019 - 20:51:41 EST


----- On Sep 4, 2019, at 2:26 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:

> On Wed, Sep 04, 2019 at 01:12:53PM -0400, Mathieu Desnoyers wrote:
>> ----- On Sep 4, 2019, at 12:09 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
>>
>> > On Wed, Sep 04, 2019 at 11:19:00AM -0400, Mathieu Desnoyers wrote:
>> >> ----- On Sep 3, 2019, at 4:36 PM, Linus Torvalds torvalds@xxxxxxxxxxxxxxxxxxxx
>> >> wrote:
>> >
>> >> > I wonder if the easiest model might be to just use a percpu variable
>> >> > instead for the membarrier stuff? It's not like it has to be in
>> >> > 'struct task_struct' at all, I think. We only care about the current
>> >> > runqueues, and those are percpu anyway.
>> >>
>> >> One issue here is that membarrier iterates over all runqueues without
>> >> grabbing any runqueue lock. If we copy that state from mm to rq on
>> >> sched switch prepare, we would need to ensure we have the proper
>> >> memory barriers between:
>> >>
>> >> prior user-space memory accesses / setting the runqueue membarrier state
>> >>
>> >> and
>> >>
>> >> setting the runqueue membarrier state / following user-space memory accesses
>> >>
>> >> Copying the membarrier state into the task struct leverages the fact that
>> >> we have documented and guaranteed those barriers around the rq->curr update
>> >> in the scheduler.
>> >
>> > Should be the same as the barriers we already rely on for rq->curr, no?
>> > That is, if we put this before switch_mm() then we have
>> > smp_mb__after_spinlock() and switch_mm() itself.
>>
>> Yes, I think we can piggy-back on the already documented barriers documented
>> around
>> rq->curr store.
>>
>> > Also, if we place mm->membarrier_state in the same cacheline as mm->pgd
>> > (which switch_mm() is bound to load) then we should be fine, I think.
>>
>> Yes, if we make sure membarrier_prepare_task_switch only updates the
>> rq->membarrier_state if prev->mm != next->mm, we should be able to avoid
>> loading next->mm->membarrier_state when switch_mm() is not invoked.
>>
>> I'll prepare RFC patch implementing this approach.
>
> Thinking about this a bit; switching it 'on' still requires some
> thinking. Consider register on an already threaded process of which
> multiple threads are 'current' on multiple CPUs. In that case none of
> the rq bits will be set.
>
> Not even synchronize_rcu() is sufficient to force it either, since we
> only update on switch_mm() and nothing guarantees we pass that.
>
> One possible approach would be to IPI broadcast (after setting the
> ->mm->membarrier_State) and having the IPI update the rq state from
> 'current->mm'.
>
> But possible I'm just confusing evryone again. I'm not having a good day
> today.

Indeed, switching 'on' requires some care. I implemented the IPI-based
approach as per your suggestion,

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com