Re: [bisected] system hang after boot

From: Martin Schwidefsky
Date: Mon Nov 27 2017 - 07:50:27 EST


On Mon, 27 Nov 2017 11:49:48 +0000
Will Deacon <will.deacon@xxxxxxx> wrote:

> Hi Peter,
>
> On Wed, Nov 22, 2017 at 09:22:17PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 22, 2017 at 06:26:59PM +0000, Will Deacon wrote:
> >
> > > Now, I can't see what the break_lock is doing here other than causing
> > > problems. Is there a good reason for it, or can you just try removing it
> > > altogether? Patch below.
> >
> > The main use is spin_is_contended(), which in turn ends up used in
> > __cond_resched_lock() through spin_needbreak().
> >
> > This allows better lock wait times for PREEMPT kernels on platforms
> > where the lock implementation itself cannot provide 'contended' state.
> >
> > In that capacity the write-write race shouldn't be a problem though.
>
> I'm not sure why it isn't a problem: given that the break_lock variable
> can read as 1 for a lock that is no longer contended and 0 for a lock that
> is currently contended, then the __cond_resched_lock is likely to see a
> value of 0 (i.e. spin_needbreak always return false) more often than no
> since it's checked by the lock holder.

Grepping for 'break_lock' the two locking blueprints are the only places
where the field is written to. Unless I am blind, the associated unlock
functions do *not* reset 'break_lock'.

Without the raw_##op##_can_lock(lock) check the first of the blueprints
now looks like this:

void __lockfunc __raw_##op##_lock(locktype##_t *lock) \
{ \
for (;;) { \
preempt_disable(); \
if (likely(do_raw_##op##_trylock(lock))) \
break; \
preempt_enable(); \
\
if (!(lock)->break_lock) \
(lock)->break_lock = 1; \
while ((lock)->break_lock) \
arch_##op##_relax(&lock->raw_lock); \
} \
(lock)->break_lock = 0; \
} \

All it takes to create an endless loop is two CPUs, the first acquired the
lock and the second tries to get the lock. After the unsuccessful trylock
of the second CPU, the first CPU releases the lock and never tries to take
it again. The second CPU will be stuck in an endless loop.

I guess my best course of action is to remove GENERIC_LOCKBREAK from
arch/s390/Kconfig to avoid this construct altogether. Let us see what
breaks if I do that ..

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.