Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
From: Boqun Feng
Date: Thu Nov 12 2015 - 10:19:16 EST
On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote:
> On 11/12, Boqun Feng wrote:
[snip]
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> > spin_lock(&lock):
> > r1 = *lock; // LL, r1 == 0
> > o = READ_ONCE(object); // could be reordered here.
> > *lock = 1; // SC
> >
> > This could happen because of the ACQUIRE semantics of spin_lock(), and
> > the current implementation of spin_lock() on PPC allows this happen.
> >
> > (Cc PPC maintainers for their opinions on this one)
>
> In this case the code above is obviously wrong. And I do not understand
> how we can rely on spin_unlock_wait() then.
>
> And afaics do_exit() is buggy too then, see below.
>
> > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> > a control dependency to pair with spin_unlock(), for example, the
> > following snippet in do_exit() is OK, except the smp_mb() is redundant,
> > unless I'm missing something subtle:
> >
> > /*
> > * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> > * when the following two conditions become true.
> > * - There is race condition of mmap_sem (It is acquired by
> > * exit_mm()), and
> > * - SMI occurs before setting TASK_RUNINNG.
> > * (or hypervisor of virtual machine switches to other guest)
> > * As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> > *
> > * To avoid it, we have to wait for releasing tsk->pi_lock which
> > * is held by try_to_wake_up()
> > */
> > smp_mb();
> > raw_spin_unlock_wait(&tsk->pi_lock);
>
> Perhaps it is me who missed something. But I don't think we can remove
> this mb(). And at the same time it can't help on PPC if I understand
You are right, we need this smp_mb() to order the previous STORE of
->state with the LOAD of ->pi_lock. I missed that part because I saw all
the explicit STOREs of ->state in do_exit() are set_current_state()
which has a smp_mb() following the STOREs.
> your explanation above correctly.
>
> To simplify, lets ignore exit_mm/down_read/etc. The exiting task does
>
>
> current->state = TASK_UNINTERRUPTIBLE;
> // without schedule() in between
> current->state = TASK_RUNNING;
>
> smp_mb();
> spin_unlock_wait(pi_lock);
>
> current->state = TASK_DEAD;
> schedule();
>
> and we need to ensure that if we race with try_to_wake_up(TASK_UNINTERRUPTIBLE)
> it can't change TASK_DEAD back to RUNNING.
>
> Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can
> read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE,
> so in fact we could have
>
> current->state = TASK_UNINTERRUPTIBLE;
>
> spin_unlock_wait(pi_lock);
>
> current->state = TASK_RUNNING;
>
> current->state = TASK_DEAD;
>
> and this can obviously race with ttwu() which can take pi_lock and see
> state == TASK_UNINTERRUPTIBLE after spin_unlock_wait().
>
Yep, my mistake ;-)
> And, if I understand you correctly, this smp_mb() can't help on PPC.
> try_to_wake_up() can read task->state before it writes to *pi_lock.
> To me this doesn't really differ from the code above,
>
> CPU 1 (do_exit) CPU_2 (ttwu)
>
> spin_lock(pi_lock):
> r1 = *pi_lock; // r1 == 0;
> p->state = TASK_UNINTERRUPTIBLE;
> state = p->state;
> p->state = TASK_RUNNING;
> mb();
> spin_unlock_wait();
> *pi_lock = 1;
>
> p->state = TASK_DEAD;
> if (state & TASK_UNINTERRUPTIBLE) // true
> p->state = RUNNING;
>
> No?
>
do_exit() is surely buggy if spin_lock() could work in this way.
> And smp_mb__before_spinlock() looks wrong too then.
>
Maybe not? As smp_mb__before_spinlock() is used before a LOCK operation,
which has both LOAD part and STORE part unlike spin_unlock_wait()?
> Oleg.
>
Attachment:
signature.asc
Description: PGP signature