Re: [PATCH 00/15] net: taint when the device driver firmware crashes

From: Jakub Kicinski
Date: Sat May 09 2020 - 14:35:55 EST


On Sat, 9 May 2020 04:35:37 +0000 Luis Chamberlain wrote:
> Device driver firmware can crash, and sometimes, this can leave your
> system in a state which makes the device or subsystem completely
> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> of scraping some magical words from the kernel log, which is driver
> specific, is much easier. So instead this series provides a helper which
> lets drivers annotate this and shows how to use this on networking
> drivers.
>
> My methodology for finding when firmware crashes is to git grep for
> "crash" and then doing some study of the code to see if this indeed
> a place where the firmware crashes. In some places this is quite
> obvious.
>
> I'm starting off with networking first, if this gets merged later on I
> can focus on the other drivers, but I already have some work done on
> other subsytems.
>
> Review, flames, etc are greatly appreciated.

Tainting itself may be useful, but that's just the first step. I'd much
rather see folks start using the devlink health infrastructure. Devlink
is netlink based, but it's _not_ networking specific (many of its
optional features obviously are, but don't let that mislead you).

With devlink health we get (a) a standard notification on the failure;
(b) information/state dump in a (somewhat) structured form, which can be
collected & shared with vendors; (c) automatic remediation (usually
device reset of some scope).

Now regarding the tainting - as I said it may be useful, but don't we
have to define what constitutes a "firmware crash"? There are many
failure modes, some perfectly recoverable (e.g. processing queue hang),
some mere bugs (e.g. device fails to initialize some functions). All of
them may impact the functioning of the system. How do we choose those
that taint?