Re: [PATCH 00/15] net: taint when the device driver firmware crashes

From: Luis Chamberlain
Date: Mon May 11 2020 - 10:11:26 EST


On Sat, May 09, 2020 at 11:35:46AM -0700, Jakub Kicinski wrote:
> On Sat, 9 May 2020 04:35:37 +0000 Luis Chamberlain wrote:
> > Device driver firmware can crash, and sometimes, this can leave your
> > system in a state which makes the device or subsystem completely
> > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> > of scraping some magical words from the kernel log, which is driver
> > specific, is much easier. So instead this series provides a helper which
> > lets drivers annotate this and shows how to use this on networking
> > drivers.
> >
> > My methodology for finding when firmware crashes is to git grep for
> > "crash" and then doing some study of the code to see if this indeed
> > a place where the firmware crashes. In some places this is quite
> > obvious.
> >
> > I'm starting off with networking first, if this gets merged later on I
> > can focus on the other drivers, but I already have some work done on
> > other subsytems.
> >
> > Review, flames, etc are greatly appreciated.
>
> Tainting itself may be useful, but that's just the first step. I'd much
> rather see folks start using the devlink health infrastructure. Devlink
> is netlink based, but it's _not_ networking specific (many of its
> optional features obviously are, but don't let that mislead you).
>
> With devlink health we get (a) a standard notification on the failure;
> (b) information/state dump in a (somewhat) structured form, which can be
> collected & shared with vendors; (c) automatic remediation (usually
> device reset of some scope).

It indeed sounds very useful!

> Now regarding the tainting - as I said it may be useful, but don't we
> have to define what constitutes a "firmware crash"?

Yes indeed, I missed clarifying this in the documentation. I'll do so
in my next respin.

> There are many
> failure modes, some perfectly recoverable (e.g. processing queue hang),
> some mere bugs (e.g. device fails to initialize some functions). All of
> them may impact the functioning of the system. How do we choose those
> that taint?

Its up to the maintainers of the device driver, what I was aiming for
were those firmware crashes which indeed *can* have an impact on user
experience, and can *even* potentially require a driver removal / addition
to to get things back in order again.

Luis