Re: [PATCH] firmware: dmi_scan: Drop dmi_initialized
From: Jean Delvare
Date: Mon Sep 25 2017 - 05:00:19 EST
On Sun, 24 Sep 2017 11:16:25 +0200, Ingo Molnar wrote:
> * Jean Delvare <jdelvare@xxxxxxx> wrote:
> > On Sat, 23 Sep 2017 12:50:31 +0200, Ingo Molnar wrote:
> > > * Jean Delvare <jdelvare@xxxxxxx> wrote:
> > > > I don't think it makes sense to check for a possible bad
> > > > initialization order at run time on every system when it is all
> > > > decided at build time.
> > > >
> > > > A more efficient way to make sure developers do not introduce new
> > > > calls to dmi_check_system() too early in the initialization sequence
> > > > is to simply document the expected call order. That way, developers
> > > > have a chance to get it right immediately, without having to
> > > > test-boot their kernel, wonder why it does not work, and parse the
> > > > kernel logs for a warning message. And we get rid of the run-time
> > > > performance penalty as a nice side effect.
> > >
> > > Huh? Initialization ordering requirements are very opaque,
> >
> > They were. Now they are very documented.
> >
> > > and by removing the debug check any such bugs are actively hidden. How
> > > is documentation supposed to uncover such bugs once they happen?
> >
> > You are looking at it the wrong way around. Documentation is how they
> > do not happen in the first place.
>
> That expectation, as a general statement, is very naive and contrary to
> experience: documentation is fine for one layer of defense to prevent bugs, but
> _when_ they happen and a bug slips through, documentation does not help anymore,
> because the dependencies in the _code_ are opaque and non-obvious ...
Seriously... dmi_scan_machine must be called before dmi_check_system,
how is that "opaque"? Non-obvious, maybe, hence the need to document it.
I can find 169 occurrences of "(must|has to|should) be called
(before|after)" in the kernel source tree, plus 19 occurrences of "call
this function (before|after)" so apparently I'm not the only fool who
thinks documenting such ordering requirements is worthwhile.
> For example during the early SMP efforts of Linux we used to document lock
> dependencies as well, but once the kernel had more than a dozen spinlocks we
> periodically ran into deadlocks and the whole design became unmaintainable
> quickly. So we have lockdep in addition to documentation.
And lockdep is good, because it helps solve a very complex problem, and
is able to catch potential issues before they happen. I'm very happy
that bright people designed it, and I am well aware that it caught a
huge number of bugs.
Also note that lockdep can be disabled, so you don't suffer the
overhead at run-time if you don't want to. There is no way to disable
the code you added back then. I wish I could make your check depend on
some general kernel debugging option, unfortunately CONFIG_DEBUG_KERNEL
covers so much that it is almost impossible to do without it and it
ends up being enabled even on production kernels. Maybe we'd need a
separate option for developer mode (something like
CFG80211_DEVELOPER_WARNINGS but generalized to the whole kernel.)
And yes, I am well aware that the performance penalty of your check was
nowhere close to that of lockdep. But add a bit here and a bit there...
More and more drivers are calling dmi_check_system.
> > You hit this problem once, 9 years ago. You thought it would have been easier to
> > debug if there was a warning, and you added it.
> > [...] It was one way to solve the problem but I claim it was not the best.
>
> I did not just 'think' it would have been easier to debug, I wasted time on that
> bug and a warning would have helped so I added it. That was and remains
> objectively true.
>
> While I expect most such warnings to never see any public email lists (because
> once a developer triggers it it gets fixed without the bug ever getting triggered
> by others), yet searching for "dmi check: not initialized yet" still finds a
> couple of incidents where real or potential bugs were found by this init
> dependency check, such as:
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2014-September/289347.html
Here it indeed helped spot a bug quickly. But I have little doubt the
issue would have been found fast even without it, as disabling DMI
support completely on half of the arm64 systems would hardly go
unnoticed for long.
> or this:
>
> https://www.spinics.net/lists/linux-acpi/msg28698.html
>
> ... so this warning actually helped a number of kernel developers to not
> waste time on the opaque dependency.
But here it was a case where it did not matter, and developers ended up
silencing the warning instead of actually fixing the ordering. Without
your warning, they would actually have saved time.
A hack which will break as soon as my patch named "firmware: dmi:
Optimize dmi_matches" [1] hits mainline, BTW. Thanks for the pointer,
I'll get in touch with the involved developers to find a proper fix.
[1] https://marc.info/?l=linux-kernel&m=150113709717948&w=2
> This is a warning that was added due
> to an _actual category of bugs_, which has been triggered subsequently as
> well, so it's not a frivolous warning by any meaning.
I don't see any category here (unlike lockdep.) That's a single bug,
with 2 known occurrences in 10 years.
> > (...)
> > This was FYI. I maintain this subsystem, and you did not convince me. I also
> > can't see a general trend of implementing what you suggest in the rest of the
> > kernel. Thankfully.
>
> I find the arrogance displayed here breathtaking as well - the last thing we need
> is for firmware interfacing kernel code to become _more_ fragile.
We don't need it to become slower, more bloated and less documented
either. And there's a trade-off in everything.
> This was and continues to be a useful warning - but what worries me even more is
> not just the removal of the warning, but the false and technically invalid
> justifications under which it is removed...
Then we have that in common. While reading the code and its history, I
was worried that the justification to add this warning in the first
place was technically weak. Not every coding error must automatically
translate to a patch to make the code robust against said error.
Sometimes you just have to admit that you did not pay attention as you
should have, fix your mistake, possibly document it for others, and
move on. Otherwise we end up with slow bloated code.
--
Jean Delvare
SUSE L3 Support