Re: [PATCH 2/4] swait: add the missing killable swaits

From: Luis R. Rodriguez
Date: Wed Jul 12 2017 - 17:33:24 EST

On Fri, Jun 30, 2017 at 12:50:03AM +0200, Luis R. Rodriguez wrote:
> ie, I expect the combination of both to fix your issues, not just the last
> series I just posted [0]. If you want this in git form you can find all of
> the patches bundled on the 20170629-fw-fixes-wait-v4 branch [1]. I just
> wrote this patch it but it seems to have not broken the tests
> From cb7fee12c6d539405793e883dfd79e0b21c2baad Mon Sep 17 00:00:00 2001
> From: "Luis R. Rodriguez" <mcgrof@xxxxxxxxxx>
> Date: Thu, 29 Jun 2017 15:19:04 -0700
> Subject: [RFT] firmware: send wake up on failure for batched requests
> Fix batched requests from waiting forever on failure.
> The firmware API supports "batched requests" which means requests with
> the same name share the same lookup effort. They wait for the first
> request to complete, however they are set to always wait for what seem
> to be forever (MAX_SCHEDULE_TIMEOUT).
> We currently handle informing waited batched requests on success but we
> never seem to have sent smoke signals of any kind on failure! This
> should mean secondary requests batched in seem to just wait forever when
> the request fails.
> For device drivers with optional firmware schemes (Intel, or Netronome),
> this could mean that when you boot a system with multiple cards the
> firmware will seem to never load on the system, or that the card is just
> not responsive even the driver initialized. Due to differences in scheduling
> possible this should not always trigger, so triggering batched requests
> actually needs to be triggered for this to be an issue.
> Its reported that at least with the Intel WiFi cards on one system this
> issue was creeping up 50% of the boots [0].
> [0]
> Reported-by: Nicolas <nbroeking@xxxxxx>
> Reported-by: John Ewalt <jewalt@xxxxxxxxxxxxxxxxxx>
> Reported-by: Jakub Kicinski <jakub.kicinski@xxxxxxxxxxxxx>
> Signed-off-by: Luis R. Rodriguez <mcgrof@xxxxxxxxxx>
> ---

FWIW I wrote a test case for this and indeed as I expected, it fixed the last
remaining issue I was aware of with using multiple cards and the firmware API.

Determining the first affected kernel was rather hard, but it would seem to be
that this became an issue once we started supporting making the fallback
mechanism optional via commit bba3a87e982ad5 ("firmware: Introduce
request_firmware_direct()", merged via v3.14.

Will follow up with patches.