Re: [PATCH v1 1/3] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

From: David Hildenbrand
Date: Thu Sep 22 2022 - 10:12:44 EST


On 21.09.22 06:40, Kalle Valo wrote:
David Hildenbrand <david@xxxxxxxxxx> writes:

Linus notes [1] that the introduction of new code that uses VM_BUG_ON()
is just as bad as BUG_ON(), because it will crash the kernel on
distributions that enable CONFIG_DEBUG_VM (like Fedora):

VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
no different, the only difference is "we can make the code smaller
because these are less important". [2]

This resulted in a more generic discussion about usage of BUG() and
friends. While there might be corner cases that still deserve a BUG_ON(),
most BUG_ON() cases should simply use WARN_ON_ONCE() and implement a
recovery path if reasonable:

The only possible case where BUG_ON can validly be used is "I have
some fundamental data corruption and cannot possibly return an
error". [2]

As a very good approximation is the general rule:

"absolutely no new BUG_ON() calls _ever_" [2]

... not even if something really shouldn't ever happen and is merely for
documenting that an invariant always has to hold. However, there are sill
exceptions where BUG_ON() may be used:

If you have a "this is major internal corruption, there's no way we can
continue", then BUG_ON() is appropriate. [3]

There is only one good BUG_ON():

Now, that said, there is one very valid sub-form of BUG_ON():
BUILD_BUG_ON() is absolutely 100% fine. [2]

While WARN will also crash the machine with panic_on_warn set, that's
exactly to be expected:

So we have two very different cases: the "virtual machine with good
logging where a dead machine is fine" - use 'panic_on_warn'. And
the actual real hardware with real drivers, running real loads by
users. [4]

The basic idea is that warnings will similarly get reported by users
and be found during testing. However, in contrast to a BUG(), there is a
way to actually influence the expected behavior (e.g., panic_on_warn)
and to eventually keep the machine alive to extract some debug info.

Ingo notes that not all WARN_ON_ONCE cases need recovery. If we don't ever
expect this code to trigger in any case, recovery code is not really
helpful.

I'd prefer to keep all these warnings 'simple' - i.e. no attempted
recovery & control flow, unless we ever expect these to trigger.
[5]

There have been different rules floating around that were never properly
documented. Let's try to clarify.

[1] https://lkml.kernel.org/r/CAHk-=wiEAH+ojSpAgx_Ep=NKPWHU8AdO3V56BXcCsU97oYJ1EA@xxxxxxxxxxxxxx
[2] https://lore.kernel.org/r/CAHk-=wg40EAZofO16Eviaj7mfqDhZ2gVEbvfsMf6gYzspRjYvw@xxxxxxxxxxxxxx
[2] https://lkml.kernel.org/r/CAHk-=wit-DmhMfQErY29JSPjFgebx_Ld+pnerc4J2Ag990WwAA@xxxxxxxxxxxxxx
[4] https://lore.kernel.org/r/CAHk-=wgF7K2gSSpy=m_=K3Nov4zaceUX9puQf1TjkTJLA2XC_g@xxxxxxxxxxxxxx
[5] https://lore.kernel.org/r/YwIW+mVeZoTOxn%2F4@xxxxxxxxx

Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>

[...]

+Use WARN_ON_ONCE() rather than WARN() or WARN_ON()
+**************************************************
+
+WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it
+is common for a given warning condition, if it occurs at all, to occur
+multiple times. This can fill up and wrap the kernel log, and can even slow
+the system enough that the excessive logging turns into its own, additional
+problem.

FWIW I have had cases where WARN() messages caused a reboot, maybe
mention that here? In my case the logging was so excessive that the
watchdog wasn't updated and in the end the device was forcefully
rebooted.


That should be covered by the last part, no? What would be your suggestion?

--
Thanks,

David / dhildenb