Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()

From: Brian Norris
Date: Tue May 19 2020 - 20:47:41 EST


Hi Luis,

On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
> On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@xxxxxxxxxxxxxxxx> wrote:
> > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > > detect that the device is really wedged enough that the only way we can
> > > still try to recover is by completely unbinding the driver from it, then
> > > we give userspace a uevent for that. I don't remember exactly how and
> > > where that gets used (ChromeOS) though, but it'd be nice to have that
> > > sort of thing as part of the infrastructure, in a sort of two-level
> > > notification?
> >
> > <slight side track>
> > We use this on certain devices where we know the underlying hardware
> > has design issues that may lead to device failure
>
> Ah, after reading below I see you meant for iwlwifi.

Sorry, I was replying to Johannes, who I believe had his "we"="Intel"
hat (as iwlwifi maintainer) on, and was pointing at
iwl_trans_pcie_removal_wk().

> If userspace can indeed grow to support this, that would be fantastic.

Well, Chrome OS tailors its user space a bit more to the hardware (and
kernel/drivers in use) than the average distro might. We already do
this (for some values of "this") today. Is that "fantastic" to you? :D

> > -- then when we see
> > this sort of unrecoverable "firmware-death", we remove the
> > device[*]+driver, force-reset the PCI device (SBR), and try to
> > reload/reattach the driver. This all happens by way of a udev rule.
>
> So you've sprikled your own udev event here as part of your kernel delta?

No kernel delta -- the event is there already:
iwl_trans_pcie_removal_wk()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027

And you can see our udev rules and scripts, in all their ugly details
here, if you really care:
https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/

> > We
> > also log this sort of stuff (and metrics around it) for bug reports
> > and health statistics, since we really hope to not see this happen
> > often.
>
> Assuming perfection is ideal but silly. So, what infrastructure do you
> use for this sort of issue?

We don't yet log firmware crashes generally, but for all our current
crash reports (including WARN()), they go through this:
https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md

For example, look for "cut here" in:
https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc

For other specific metrics (like counting "EVENT=INACCESSIBLE"), we
use the Chrome UMA system:
https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md

I don't imagine the "infrastructure" side of any of that would be
useful to you, but maybe the client-side gathering can at least show
you what we do.

> > [*] "We" (user space) don't actually do this...it happens via the
> > 'remove_when_gone' module parameter abomination found in iwlwifi.
>
> BTW is this likely a place on iwlwifi where the firmware likely crashed?

iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out
in a way that is likely due to a dead PCIe endpoint. It's not directly
a firmware crash, although there may be firmware crashes reported
around the same time.

Intel folks can probably give a more nuanced answer, but their
firmware crashes usually land in something like iwl_mvm_nic_error() ->
iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make
your proposal less punishing (e.g., removing the "taint" as Johannes
requested), maybe iwlwifi would be another good candidate for
$subject. iwlwifi generally expects to recover seamlessly, but it's
also good to know if you've seen a hundred of these in a row.

> > A uevent
> > would suit us very well I think, although it would be nice if drivers
> > could also supply some small amount of informative text along with it
>
> A follow up to this series was to add a uevent to add_taint(), however
> since a *count* is not considered I think it is correct to seek
> alternatives at this point. The leaner the solution the better though.
>
> Do you have a pointer to what guys use so I can read?

Hopefully the above pointers are useful to you. We don't yet have
firmware-crash parsing implemented, so I don't have pointers for that
piece yet.

> > (e.g., a sort of "reason code", in case we can possibly aggregate
> > certain failure types). We already do this sort of thing for WARN()
> > and friends (not via uevent, but via log parsing; at least it has nice
> > "cut here" markers!).
>
> Indeed, similar things can indeed be argued about WARN*()... this
> however can be non-device specific. With panic-on-warn becoming a
> "thing", the more important it becomes to really tally exactly *why*
> these WARN*()s may trigger.

panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm
probably not supposed to publicize information related to how many
Chromebooks are out there, but we regularly see a scary number of
non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users
every day, and a scary number of those are in WiFi drivers...

> > Perhaps
>
> Note below.

(My use of "perhaps" is only because I'm not familiar with devlink at all.)

> > devlink (as proposed down-thread) would also fit the bill. I
> > don't think sysfs alone would fit our needs, as we'd like to process
> > these things as they happen, not only when a user submits a bug
> > report.
>
> I think we've reached a point where using "*Perhaps*" does not suffice,
> and if there is already a *user* of similar desired infrastructure I
> think we should jump on the opportunity to replace what you have with
> something which could be used by other devices / subsystems which
> require firmware. And indeed, also even consider in the abstract sense,
> the possibility to leverage something like this for WARN*()s later too.

To precisely lay out what Chrome OS does today:

* WARN() and similar: implemented, see anomaly_detector.cc above. It's
just log parsing, and that handy "cut here" stuff -- I'm nearly
certain there are other distros using this already? A uevent would
probably be nice, but log parsing is good enough for us today.
* EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked
above. It's a uevent, and we handle it via udev. Code is linked above.
* Other firmware crashes: not yet implemented here, but we're looking
at it. Currently, we plan to do something similar to WARN(), but it
will be ugly and non-generic. A uevent would be a lovely replacement,
if it has some basic metadata with it. Stats in sysfs might be icing
on the cake, but mostly redundant for us, if we can grab it on the fly
via uevent.

HTH,
Brian