Re: [RFC] Are you good with Lockdep?

From: Byungchul Park
Date: Thu Nov 12 2020 - 01:17:07 EST


On Wed, Nov 11, 2020 at 11:54:41AM +0100, Ingo Molnar wrote:
> > We cannot get reported other than the first one.
>
> Correct. Experience has shown that the overwhelming majority of
> lockdep reports are single-cause and single-report.
>
> This is an optimal approach, because after a decade of exorcising
> locking bugs from the kernel, lockdep is currently, most of the time,

I also think Lockdep has been doing great job exorcising almost all
locking bugs so far. Respect it.

> in 'steady-state', with there being no reports for the overwhelming
> majority of testcases, so the statistical probability of there being
> just one new report is by far the highest.

This is true if Lockdep is only for checking if maintainers' tree are
ok and if we totally ignore how a tool could help folks in the middle of
development esp. when developing something complicated wrt.
synchronization.

But I don't agree if a tool could help while developing something that
could introduce many dependency issues.

> If on the other hand there's some bug in lockdep itself that causes
> excessive false positives, it's better to limit the number of reports
> to one per bootup, so that it's not seen as a nuisance debugging
> facility.
>
> Or if lockdep gets extended that causes multiple previously unreported
> (but very much real) bugs to be reported, it's *still* better to
> handle them one by one: because lockdep doesn't know whether it's real

Why do you think we cannot handle them one by one with multi-reporting?
We can handle them with the first one as we do with single-reporting.
And also that's how we work, for example, when building the kernel or
somethinig.

> > So the one who has introduced the first one should fix it as soon
> > as possible so that the other problems can be reported and fixed.
> > It will get even worse if it's a false positive because it's
> > worth nothing but only preventing reporting real ones.
>
> Since kernel development is highly distributed, and 90%+ of new
> commits get created in dozens of bigger and hundreds of smaller
> maintainer topic trees, the chance of getting two independent locking
> bugs in the same tree without the first bug being found & fixed is
> actually pretty low.

Again, this is true if Lockdep is for checking maintainers' tree only.

> linux-next offers several weeks/months advance integration testing to
> see whether the combination of maintainer trees causes
> problems/warnings.

Good for us.

> > That's why kernel developers are so sensitive to Lockdep's false
> > positive reporting - I would, too. But precisely speaking, it's a
> > problem of how Lockdep was designed and implemented, not false
> > positive itself. Annoying false positives - as WARN()'s messages are
> > annoying - should be fixed but we don't have to be as sensitive as we
> > are now if the tool keeps normally working even after reporting.
>
> I disagree, and even for WARN()s we are seeing a steady movement
> towards WARN_ON_ONCE(): exactly because developers are usually
> interested in the first warning primarily.
>
> Followup warnings are even marked 'tainted' by the kernel - if a bug
> happened we cannot trust the state of the kernel anymore, even if it
> seems otherwise functional. This is doubly true for lockdep, where

I definitely think so. Already tainted kernel is not the kernel we can
trust anymore. Again, IMO, a tool should help us not only for checking
almost final trees but also in developing something. No?

> But for lockdep there's another concern: we do occasionally report
> bugs in locking facilities themselves. In that case it's imperative
> for all lockdep activity to cease & desist, so that we are able to get
> a log entry out before the kernel goes down potentially.

Sure. Makes sense.

> I.e. there's a "race to log the bug as quickly as possible", which is
> the other reason we shut down lockdep immediately. But once shut down,

Not sure I understand this part.

> all the lockdep data structures are hopelessly out of sync and it
> cannot be restarted reasonably.

Is it about tracking IRQ and IRQ-enabled state? That's exactly what I'd
like to point out. Or is there something else?

> Not sure I understand the "problem 2)" outlined here, but I'm looking
> forward to your patchset!

Thank you for the response.

Thanks,
Byungchul