Re: Planned changes for to reduce the "Bugzilla blues"

From: Takashi Iwai
Date: Sun Oct 02 2022 - 05:14:36 EST

On Sun, 02 Oct 2022 10:23:07 +0200,
Artem S. Tashkinov wrote:
> On 10/2/22 07:37, Takashi Iwai wrote:
> > On Sat, 01 Oct 2022 12:30:22 +0200,
> > Artem S. Tashkinov wrote:
> >> - 2 -
> >>
> >> Here's another one which is outright puzzling:
> >>
> >> You run: dmesg -t --level=emerg,crit,err
> >>
> >> And you see some non-descript errors of some kernel subsystems seemingly
> >> failing or being unhappy about your hardware. Errors are as cryptic as
> >> humanly possible, you don't even know what part of kernel has produced them.
> >>
> >> OK, as a "power" user I download the kernel source, run `grep -R message
> >> /tmp/linux-5.19` and there are _multiple_ different modules and places
> >> which contain this message.
> >>
> >> I'm lost. Send this to LKML? Did that in the long past, no one cared, I
> >> stopped.
> >>
> >> Here's what I'm getting with Linux 5.19.12:
> >>
> >> platform wdat_wdt: failed to claim resource 5: [mem
> >> 0x00000000-0xffffffff7fffffff]
> >> ACPI: watchdog: Device creation failed: -16
> >> ACPI BIOS Error (bug): Could not resolve symbol
> >> [\_SB.PCI0.XHC.RHUB.TPLD], AE_NOT_FOUND (20220331/psargs-330)
> >> ACPI Error: Aborting method \_SB.UBTC.CR01._PLD due to previous error
> >> (AE_NOT_FOUND) (20220331/psparse-529)
> >> platform MSFT0101:00: failed to claim resource 1: [mem
> >> 0xfed40000-0xfed40fff]
> >> acpi MSFT0101:00: platform device creation failed: -16
> >> lis3lv02d: unknown sensor type 0x0
> >>
> >> Are they serious? Should they be reported or not? Is my laptop properly
> >> working? I have no clue at all.
> >
> > That's a dilemma. The kernel can't know whether it's "properly"
> > working, either -- that is, whether the lack of some functions matters
> > for you or not. In your case above, it's about a watchdog, something
> > related with USB, TPM, and acceleration sensor, all of which likely
> > come from a buggy BIOS. Would you mind if those features are missing?
> > Or even whether your device has a correct hardware implementation?
> > Kernel doesn't know, hence it complains as an error.
> >
> > In many drivers, there are mechanisms to shut off superfluous error
> > messages for known devices. So it's case-by-case solutions.
> >
> > Or you can completely hide those errors at boot by a boot option
> > (e.g. loglevel=2).
> The problem is some of such messages are indeed indicative of certain
> real issues which result in HW not working properly, including:
> 1) missing/incorrect firmware
> 2) most importantly: not enabled power saving modes
> 3) not enabled high performance modes
> 4) not enabled devices
> 5) not enabled devices' functions
> 6) drivers conflicts (i.e. the wrong module gets loaded for the device)
> 7) physically failing hardware
> I'm quite sure you don't really know what half of those messages
> actually mean.

Of course: not because those messages are hardly understandable but
because those messages indicate only the cause, and the exact end
result can't be always known from the kernel at that point. A lack of
physical failing hardware? Not enabled devices? Who knows. There
might be some alternative, even a user-space driver.

> Speaking of 7. Various kernel subsystems/drivers deal with e.g. mass
> storage which is known to fail quite often. There's not a single driver
> in the kernel which is actually brave enough to spew something like this:
> "/dev/xxxx might be failing, please RMA or seek help online"
> instead you get a dmesg choke full of "unable to read sector XXX" or
> something like that.

Oh you suggest that we should put "please RMA or seek help online" to
each printk of KERN_ERR level, if it saves the world? ;)

IMO, what matters for users is whether the system works or not. It's
not how the kernel message appears. A kernel message may help for
diagnose, but the message itself is no solution; that is, the most
importance of a kernel message is that it indicates a real error that
can be diagnosed by developers.

If the end effect is pretty sure, a message may be more chatty. OTOH,
people are annoyed by such too verbosity, too. So it needs a sensible

> To return to the previous errors: it's impossible for the user to assess
> their severity and that sucks.

Right, that's why I wrote it's a dilemma.

> What is "platform device creation
> failed"? What is "unknown sensor type"? What am I missing? Who's
> responsible? The kernel? My HW vendor? Are those errors actionable?

All those depend on the driver implementation and the hardware
implementation. There is no general answer at all, unfortunately.

> In
> my understanding a properly working computer must not produce
> "emerg,crit,err" errors. I'm not even talking about "warn,info" and such.

Yes, some errors can be downgraded to warn or even to info.
I myself find ACPI is way too chatty, too.

So I believe something we can improve is to define some more clear
guideline for KERN_ERR level errors.