Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
From: Peter Zijlstra
Date: Mon Oct 26 2020 - 11:23:09 EST
On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
> > > That smells like the same issue reported here:
> > >
> > > https://lkml.kernel.org/r/20201022111700.GZ2651@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > >
> > > Make sure you have commit:
> > >
> > > f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> > >
> > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
> >
> > Yes, CONFIG_DEBUG_PREEMPT is enabled.
>
> Bummer :/
>
> > I'll try with that commit and let you know, however it's gonna take a
> > few hours to build a kernel and run all fstests (on that test box it
> > takes over 3 hours) to confirm that fixes the issue.
>
> *ouch*, 3 hours is painful. How long to make it sick with the current
> kernel? quicker I would hope?
>
> > Thanks for the quick reply!
>
> Anyway, I don't think that commit can actually explain the issue :/
>
> The false positive on lockdep_assert_held() happens when the recursion
> count is !0, however we _should_ be having IRQs disabled when
> lockdep_recursion > 0, so that should never be observable.
>
> My hope was that DEBUG_PREEMPT would trigger on one of the
> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
> then be a clear violation.
>
> And you're seeing this on x86, right?
>
> Let me puzzle moar..
So I might have an explanation for the Sparc64 fail, but that can't
explain x86 :/
I initially thought raw_cpu_read() was OK, since if it is !0 we have
IRQs disabled and can't get migrated, so if we get migrated both CPUs
must have 0 and it doesn't matter which 0 we read.
And while that is true; it isn't the whole store, on pretty much all
architectures (except x86) this can result in computing the address for
one CPU, getting migrated, the old CPU continuing execution with another
task (possibly setting recursion) and then the new CPU reading the value
of the old CPU, which is no longer 0.
I already fixed a bunch of that in:
baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"")
but clearly this one got crossed.
Still, that leaves me puzzled over you seeing this on x86 :/
Anatoly, could you try linus+tip/locking/urgent and the below on your
Sparc, please?
---
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 3e99dfef8408..a3041463e42d 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -84,7 +84,7 @@ static inline bool lockdep_enabled(void)
if (!debug_locks)
return false;
- if (raw_cpu_read(lockdep_recursion))
+ if (this_cpu_read(lockdep_recursion))
return false;
if (current->lockdep_recursion)