Re: [PATCH] wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s

From: Brian Norris
Date: Tue Dec 03 2024 - 21:04:52 EST


Hi Doug,

On Tue, Dec 03, 2024 at 05:38:55PM -0800, Doug Anderson wrote:
> On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@xxxxxxxxxxxx> wrote:
> > The Kconfig default is 120 seconds, and it's hidden under
> > CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds
> > pretty small for triggering panics.) The smallest I can see outside of
> > ChromeOS is 12 seconds, although 60 seconds is much more common. I
> > also happen to see other WiFi drivers have hit similar problems, but
> > they settled on 20 seconds (assuming a 60s timeout on other distros):
> > https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@xxxxxxxxxx/
> > https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17
> >
> > Technically, this Kconfig lets you set a value as small as 1 second. I
> > don't think we should work at reducing all timeouts to fit that
> > target.
> >
> > I could maybe approve lowering this one, but I'd also recommend
> > reconsidering your timeout value. And more questions below.
>
> That's fair. I guess having a 10 second timeout for full system
> suspend didn't seem totally crazy to me. If a system is taking more
> than 10 seconds to do a full system suspend then that seems like
> something is pretty broken. I guess it's somewhat like the same
> argument that the WiFi driver had for picking 10 seconds but applied
> to the whole system level, and I guess that's where we get into
> trouble. That made me think that even 5 seconds seems a bit long for
> any given driver to suspend. ...but yeah, it's squishy.

10 seconds is likely that *something* is wrong (or at least suboptimal),
but IMO, it's not quite at unreasonable levels. But yes, my point was
mainly that it's squishy, and I personally wouldn't want to be the one
running with the lowest CONFIG_DPM_WATCHDOG_TIMEOUT out there, given the
known behavior of multiple drivers and the timeout-means-panic behavior.

> Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog
> timer and that's a better solution and leave the WiFi driver how it
> is?

That seems reasonable.

To be clear, I'm OK with this patch, if we can get a little more
confidence in it (like the timing data and HW info). I *think* 5 vs 10
isn't a big deal here, but I don't have much other than my guess at the
moment.

> Another thought: I wonder if it's possible to be dynamic and somehow
> set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've

You probably meant min()?

> checked to see if the mwifiex can actually query the DPM watchdog
> timeout... ;-)

Yeah, I wondered similarly -- or in reverse, if we could somehow "pat"
the watchdog or prime it with an expected timeout. But AFAICT, neither
such feature exists today.

> ...also, it sure seems like if we're going to choose a value so low
> that we shouldn't panic. All of our other watchdogs that panic aren't
> so short, so probably this one shouldn't be either. Maybe we could
> submit a patch to make the DPM watchdog just abort the suspend if
> that's not too hard (and if the power people accept it).

Yeah, if you made the watchdog just interrupt suspend and dump some
warnings, then the effect would be pretty similar to this patch.

> > I wonder what the distribution of (successful) timing is today. I'd
> > assume this typically take on the order of milliseconds, but do we
> > commonly see outliers? What if a system is fully loaded while
> > suspending?
>
> I would hope this doesn't affect things from the DPM watchdog, but I
> haven't researched. Hopefully the DPM watchdog starts after userspace
> is frozen so the system being fully loaded shouldn't matter?

I was just throwing out ideas, but I didn't specifically mean user
space. You provided a few more ideas. Anyway, I was just fishing for
*some* attempt at real-world (and, as-bad-as-you-can-simulate world)
numbers, since that's the point of a timeout like this.

> > Can you try testing (and gather timing numbers) when
> > suspending soon after initiating scans? It's hard to judge what the
> > lower limit of this timeout should really be without any numbers, just
> > like it's hard to judge whether your 10 second watchdog is reasonable.
>
> Pin-yen: is this something you could gather?
>
>
> > Also, for the record, since we might have to field regression reports
> > for other systems: what hardware (chip variant, FW version) are you
> > seeing problems on?
>
> Pin-yen: I'm assuming you'll provide this.

I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide
the above to provide a little more confidence, or if you want to
reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT.

Brian