Re: [RFC PATCH 1/2] Fix: sched/membarrier: p->mm->membarrier_state racy load
From: Peter Zijlstra
Date: Wed Sep 04 2019 - 14:26:38 EST
On Wed, Sep 04, 2019 at 01:12:53PM -0400, Mathieu Desnoyers wrote:
> ----- On Sep 4, 2019, at 12:09 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
>
> > On Wed, Sep 04, 2019 at 11:19:00AM -0400, Mathieu Desnoyers wrote:
> >> ----- On Sep 3, 2019, at 4:36 PM, Linus Torvalds torvalds@xxxxxxxxxxxxxxxxxxxx
> >> wrote:
> >
> >> > I wonder if the easiest model might be to just use a percpu variable
> >> > instead for the membarrier stuff? It's not like it has to be in
> >> > 'struct task_struct' at all, I think. We only care about the current
> >> > runqueues, and those are percpu anyway.
> >>
> >> One issue here is that membarrier iterates over all runqueues without
> >> grabbing any runqueue lock. If we copy that state from mm to rq on
> >> sched switch prepare, we would need to ensure we have the proper
> >> memory barriers between:
> >>
> >> prior user-space memory accesses / setting the runqueue membarrier state
> >>
> >> and
> >>
> >> setting the runqueue membarrier state / following user-space memory accesses
> >>
> >> Copying the membarrier state into the task struct leverages the fact that
> >> we have documented and guaranteed those barriers around the rq->curr update
> >> in the scheduler.
> >
> > Should be the same as the barriers we already rely on for rq->curr, no?
> > That is, if we put this before switch_mm() then we have
> > smp_mb__after_spinlock() and switch_mm() itself.
>
> Yes, I think we can piggy-back on the already documented barriers documented around
> rq->curr store.
>
> > Also, if we place mm->membarrier_state in the same cacheline as mm->pgd
> > (which switch_mm() is bound to load) then we should be fine, I think.
>
> Yes, if we make sure membarrier_prepare_task_switch only updates the
> rq->membarrier_state if prev->mm != next->mm, we should be able to avoid
> loading next->mm->membarrier_state when switch_mm() is not invoked.
>
> I'll prepare RFC patch implementing this approach.
Thinking about this a bit; switching it 'on' still requires some
thinking. Consider register on an already threaded process of which
multiple threads are 'current' on multiple CPUs. In that case none of
the rq bits will be set.
Not even synchronize_rcu() is sufficient to force it either, since we
only update on switch_mm() and nothing guarantees we pass that.
One possible approach would be to IPI broadcast (after setting the
->mm->membarrier_State) and having the IPI update the rq state from
'current->mm'.
But possible I'm just confusing evryone again. I'm not having a good day
today.