Hmm, yes. What it originally boiled down to, which is why Linus first
objected to BUG(), was that we don't know what other parts of the system
have been disrupted. The best example is just that of locking: if we
BUG() or do_exit() in the middle of holding a lock, we'll wreck whatever
subsystem that was attached to. Without a deterministic system state
unwinder, there really isn't a "safe" way to just stop a kernel thread.
With this pkill_on_warn, we avoid the BUG problem (since the thread of
execution continues and stops at an 'expected' place: the signal
handler).
However, now we have the newer objection from Linus, which is one of
attribution: the WARN might be hit during an "unrelated" thread of
execution and "current" gets blamed, etc. And beyond that, if we take
down a portion of userspace, what in userspace may be destabilized? In
theory, we get a case where any required daemons would be restarted by
init, but that's not "known".
The safest version of this I can think of is for processes to opt into
this mitigation. That would also cover the "special cases" we've seen
exposed too. i.e. init and kthreads would not opt in.
However, that's a lot to implement when Marco's tracing suggestion might
be sufficient and policy could be entirely implemented in userspace. It
could be as simple as this (totally untested):