Re: sched: softlockups in multi_cpu_stop

From: Jason Low
Date: Fri Mar 06 2015 - 13:58:10 EST


On Fri, 2015-03-06 at 09:19 -0800, Davidlohr Bueso wrote:
> On Fri, 2015-03-06 at 13:32 +0100, Ingo Molnar wrote:
> > * Sasha Levin <sasha.levin@xxxxxxxxxx> wrote:
> >
> > > I've bisected this to "locking/rwsem: Check for active lock before bailing on spinning". Relevant parties Cc'ed.
> >
> > That would be:
> >
> > 1a99367023f6 ("locking/rwsem: Check for active lock before bailing on spinning")
>
> > diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> > index 1c0d11e8ce34..e4ad019e23f5 100644
> > --- a/kernel/locking/rwsem-xadd.c
> > +++ b/kernel/locking/rwsem-xadd.c
> > @@ -298,23 +298,30 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
> > static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> > {
> > struct task_struct *owner;
> > - bool on_cpu = false;
> > + bool ret = true;
> >
> > if (need_resched())
> > return false;
> >
> > rcu_read_lock();
> > owner = ACCESS_ONCE(sem->owner);
> > - if (owner)
> > - on_cpu = owner->on_cpu;
> > - rcu_read_unlock();
> > + if (!owner) {
> > + long count = ACCESS_ONCE(sem->count);
> > + /*
> > + * If sem->owner is not set, yet we have just recently entered the
> > + * slowpath with the lock being active, then there is a possibility
> > + * reader(s) may have the lock. To be safe, bail spinning in these
> > + * situations.
> > + */
> > + if (count & RWSEM_ACTIVE_MASK)
> > + ret = false;
> > + goto done;
>
> Hmmm so the lockup would be due to this (when owner is non-nil the patch
> has no effect), telling users to spin instead of sleep -- _except_ for
> this condition. And when spinning we're always checking for need_resched
> to be safe. So even if this function was completely bogus, we'd end up
> needlessly spinning but I'm surprised about the lockup. Maybe coffee
> will make things clearer.

Right, the can_spin_on_owner() was originally added to the mutex
spinning code for optimization purposes, particularly so that we can
avoid adding the spinner to the OSQ only to find that it doesn't need to
spin. This function needing to return a correct value should really only
affect performance, so yes, lockups due to this seems surprising.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/