Re: [PATCH] wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s
From: Brian Norris
Date: Tue Dec 03 2024 - 18:05:23 EST
Hi,
On Wed, Nov 27, 2024 at 7:44 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> On Wed, Nov 27, 2024 at 2:58 AM Pin-yen Lin <treapking@xxxxxxxxxxxx> wrote:
> > In commit 52250cbee7f6 ("mwifiex: use timeout variant for
> > wait_event_interruptible") it was noted that sometimes we seemed
> > to miss the signal that our host sleep settings took effect. A
> > 10 second timeout was added to the code to make sure we didn't
> > hang forever waiting. It appears that this problem still exists
> > and we hit the timeout sometimes for Chromebooks in the field.
> >
> > Recently on ChromeOS we've started setting the DPM watchdog
> > to trip if full system suspend takes over 10 seconds. Given
> > the timeout in the original patch, obviously we're hitting
> > the DPM watchdog before mwifiex gets a chance to timeout.
The Kconfig default is 120 seconds, and it's hidden under
CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds
pretty small for triggering panics.) The smallest I can see outside of
ChromeOS is 12 seconds, although 60 seconds is much more common. I
also happen to see other WiFi drivers have hit similar problems, but
they settled on 20 seconds (assuming a 60s timeout on other distros):
https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@xxxxxxxxxx/
https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17
Technically, this Kconfig lets you set a value as small as 1 second. I
don't think we should work at reducing all timeouts to fit that
target.
I could maybe approve lowering this one, but I'd also recommend
reconsidering your timeout value. And more questions below.
> > While we could increase the DPM watchdog in ChromeOS to avoid
> > this problem, it's probably better to simply decrease the
> > timeout. Any time we're waiting several seconds for the
> > firmware to respond it's likely that the firmware won't ever
> > respond. With that in mind, decrease the timeout in mwifiex
> > from 10 seconds to 5 seconds.
> >
> > Suggested-by: Doug Anderson <dianders@xxxxxxxxxxxx>
> > Signed-off-by: Pin-yen Lin <treapking@xxxxxxxxxxxx>
> > ---
> >
> > drivers/net/wireless/marvell/mwifiex/sta_ioctl.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
>
> I believe Brian Norris is still anointed as the personally nominally
> in charge of mwifiex upstream (get_maintainer labels him as "odd"
> fixer, which seems awfully judgemental), so he should be CCed on
> fixes. Added him.
I tend to see mwifiex patches even if I don't get CC'd, but thanks. I
wonder why get_maintainer(?) picked up Francesco properly but not me.
*shrug*
> > diff --git a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
> > index e06a0622973e..f79589cafe57 100644
> > --- a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
> > +++ b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
> > @@ -545,7 +545,7 @@ int mwifiex_enable_hs(struct mwifiex_adapter *adapter)
> >
> > if (wait_event_interruptible_timeout(adapter->hs_activate_wait_q,
> > adapter->hs_activate_wait_q_woken,
> > - (10 * HZ)) <= 0) {
> > + (5 * HZ)) <= 0) {
>
> Given that I suggested this fix, it should be no surprise:
>
> Reviewed-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
>
> Upon sleeping on the idea, the only slight concern I have here is
> whether we'll trigger this timeout if we try to suspend while WiFi
> scanning is in progress. I don't have any actual evidence supporting
> this concern, but I remember many years ago when I used to deal with
> the WiFi drivers more often there were cases where suspend could be
> delayed if it happened while a scan was in progress. Maybe/hopefully
> that's fixed now, but I figured I'd at least bring it up in case it
> tickled anything in someone's mind...
Scans should essentially have been canceled by now, but IIUC, the
driver doesn't really force the card to stop if it's in the middle of
a scan (I'm not 100% sure if it's possible). So it's possible that
pending scans could delay this step.
I wonder what the distribution of (successful) timing is today. I'd
assume this typically take on the order of milliseconds, but do we
commonly see outliers? What if a system is fully loaded while
suspending? Can you try testing (and gather timing numbers) when
suspending soon after initiating scans? It's hard to judge what the
lower limit of this timeout should really be without any numbers, just
like it's hard to judge whether your 10 second watchdog is reasonable.
Also, for the record, since we might have to field regression reports
for other systems: what hardware (chip variant, FW version) are you
seeing problems on?
Brian