Re: BUG_ON() in workingset_node_shadows_dec() triggers

From: Willy Tarreau
Date: Wed Oct 05 2016 - 15:06:37 EST


On Wed, Oct 05, 2016 at 08:52:54AM -0700, Linus Torvalds wrote:
> On Tue, Oct 4, 2016 at 10:44 PM, Willy Tarreau <w@xxxxxx> wrote:
> >
> > I think instead we should completely remove any simple way to halt the
> > system and document how to do it.
>
> Having slept on it, I suspect you're right. I worry about some
> BUG_ON() that really relies on the killing behavior, but if it takes a
> "real" fault later, that is when it gets killed. And on the whole,
> we've had lots of problems with the killing behavior over the years,
> so we should just try switching BUG_ON() over to non-fatal. It's
> unlikely to be worse than what we have now, as exemplified by this
> event.

I have the same doubts, so at least I would not want to run the "sed"
immediately, at least to keep the initial intent. But I think everyone
is right in is own yard when he puts a BUG_ON() when he doesn't know
how to handle an unsafe situation, he's wrong from a global perspective.

For example, it could be seen as safe to crash the system in a filesystem
driver to protect against the risk of data corruption resulting from an
impossible condition, but when this happens due to a dirty FS on a USB
stick that a person inserts on the PC to save her work, actually the
BUG_ON() is the one responsible for the data loss. Even something as
painful as leaving a process in D state in this situation would have
been cleaner as it would let the admin reboot when he wants and not
have to experience it at the worst moment.

I've already met 100% reproducible panics that I never had the time to
inestigate (one involving running an mmap-based hex editor on /dev/mem,
and the other one doing stupid things with mount --move), and I'm sure
once I find the cause I'll see a BUG_ON() that should have been a warning.

I'm pretty sure there are historically valid BUG_ON() that are probably
not needed anymore just like I'm also convinced that some of them are
hard to get rid of. Maybe at least having the same as WARN_ON() but
prepending the dump with a message saying "you encountered a critical
bug which should have crashed the kernel, you must absolutely report it"
would help at the beginning.

Cheers,
Willy