Re: [PATCH] wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s

From: Doug Anderson
Date: Tue Dec 03 2024 - 20:39:19 EST


Hi,

On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@xxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Wed, Nov 27, 2024 at 7:44 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> > On Wed, Nov 27, 2024 at 2:58 AM Pin-yen Lin <treapking@xxxxxxxxxxxx> wrote:
> > > In commit 52250cbee7f6 ("mwifiex: use timeout variant for
> > > wait_event_interruptible") it was noted that sometimes we seemed
> > > to miss the signal that our host sleep settings took effect. A
> > > 10 second timeout was added to the code to make sure we didn't
> > > hang forever waiting. It appears that this problem still exists
> > > and we hit the timeout sometimes for Chromebooks in the field.
> > >
> > > Recently on ChromeOS we've started setting the DPM watchdog
> > > to trip if full system suspend takes over 10 seconds. Given
> > > the timeout in the original patch, obviously we're hitting
> > > the DPM watchdog before mwifiex gets a chance to timeout.
>
> The Kconfig default is 120 seconds, and it's hidden under
> CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds
> pretty small for triggering panics.) The smallest I can see outside of
> ChromeOS is 12 seconds, although 60 seconds is much more common. I
> also happen to see other WiFi drivers have hit similar problems, but
> they settled on 20 seconds (assuming a 60s timeout on other distros):
> https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@xxxxxxxxxx/
> https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17
>
> Technically, this Kconfig lets you set a value as small as 1 second. I
> don't think we should work at reducing all timeouts to fit that
> target.
>
> I could maybe approve lowering this one, but I'd also recommend
> reconsidering your timeout value. And more questions below.

That's fair. I guess having a 10 second timeout for full system
suspend didn't seem totally crazy to me. If a system is taking more
than 10 seconds to do a full system suspend then that seems like
something is pretty broken. I guess it's somewhat like the same
argument that the WiFi driver had for picking 10 seconds but applied
to the whole system level, and I guess that's where we get into
trouble. That made me think that even 5 seconds seems a bit long for
any given driver to suspend. ...but yeah, it's squishy.

Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog
timer and that's a better solution and leave the WiFi driver how it
is?

Another thought: I wonder if it's possible to be dynamic and somehow
set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've
checked to see if the mwifiex can actually query the DPM watchdog
timeout... ;-)

...also, it sure seems like if we're going to choose a value so low
that we shouldn't panic. All of our other watchdogs that panic aren't
so short, so probably this one shouldn't be either. Maybe we could
submit a patch to make the DPM watchdog just abort the suspend if
that's not too hard (and if the power people accept it).


> > > While we could increase the DPM watchdog in ChromeOS to avoid
> > > this problem, it's probably better to simply decrease the
> > > timeout. Any time we're waiting several seconds for the
> > > firmware to respond it's likely that the firmware won't ever
> > > respond. With that in mind, decrease the timeout in mwifiex
> > > from 10 seconds to 5 seconds.
> > >
> > > Suggested-by: Doug Anderson <dianders@xxxxxxxxxxxx>
> > > Signed-off-by: Pin-yen Lin <treapking@xxxxxxxxxxxx>
> > > ---
> > >
> > > drivers/net/wireless/marvell/mwifiex/sta_ioctl.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > I believe Brian Norris is still anointed as the personally nominally
> > in charge of mwifiex upstream (get_maintainer labels him as "odd"
> > fixer, which seems awfully judgemental), so he should be CCed on
> > fixes. Added him.
>
> I tend to see mwifiex patches even if I don't get CC'd, but thanks. I
> wonder why get_maintainer(?) picked up Francesco properly but not me.
> *shrug*
>
> > > diff --git a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
> > > index e06a0622973e..f79589cafe57 100644
> > > --- a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
> > > +++ b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
> > > @@ -545,7 +545,7 @@ int mwifiex_enable_hs(struct mwifiex_adapter *adapter)
> > >
> > > if (wait_event_interruptible_timeout(adapter->hs_activate_wait_q,
> > > adapter->hs_activate_wait_q_woken,
> > > - (10 * HZ)) <= 0) {
> > > + (5 * HZ)) <= 0) {
> >
> > Given that I suggested this fix, it should be no surprise:
> >
> > Reviewed-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
> >
> > Upon sleeping on the idea, the only slight concern I have here is
> > whether we'll trigger this timeout if we try to suspend while WiFi
> > scanning is in progress. I don't have any actual evidence supporting
> > this concern, but I remember many years ago when I used to deal with
> > the WiFi drivers more often there were cases where suspend could be
> > delayed if it happened while a scan was in progress. Maybe/hopefully
> > that's fixed now, but I figured I'd at least bring it up in case it
> > tickled anything in someone's mind...
>
> Scans should essentially have been canceled by now, but IIUC, the
> driver doesn't really force the card to stop if it's in the middle of
> a scan (I'm not 100% sure if it's possible). So it's possible that
> pending scans could delay this step.
>
> I wonder what the distribution of (successful) timing is today. I'd
> assume this typically take on the order of milliseconds, but do we
> commonly see outliers? What if a system is fully loaded while
> suspending?

I would hope this doesn't affect things from the DPM watchdog, but I
haven't researched. Hopefully the DPM watchdog starts after userspace
is frozen so the system being fully loaded shouldn't matter?

Things I could believe mattering are things like:

* If memory is full and something in the suspend patch allocates a big
chunk of memory then maybe (??) that could slow things down?

* If lots of USB devices are connected that could slow things down.

* If there are a large number of WiFi networks or APs or Bluetooth
devices I could believe that could cause suspend problems.


> Can you try testing (and gather timing numbers) when
> suspending soon after initiating scans? It's hard to judge what the
> lower limit of this timeout should really be without any numbers, just
> like it's hard to judge whether your 10 second watchdog is reasonable.

Pin-yen: is this something you could gather?


> Also, for the record, since we might have to field regression reports
> for other systems: what hardware (chip variant, FW version) are you
> seeing problems on?

Pin-yen: I'm assuming you'll provide this.

-Doug