Re: [RFC/RFT PATCH] PM: sleep: Ignore device driver suspend() callback return values

From: Saravana Kannan
Date: Thu Dec 05 2024 - 15:58:42 EST


On Thu, Dec 5, 2024 at 9:57 AM Florian Fainelli
<florian.fainelli@xxxxxxxxxxxx> wrote:
>
> On 12/5/24 09:36, Len Brown wrote:
> > On Thu, Dec 5, 2024 at 10:33 AM Ulf Hansson <ulf.hansson@xxxxxxxxxx> wrote:
> >
> >> ...I also think this looks a bit risky as the current behaviour
> >> has really been there for a long time. Who knows what depends on this.
> >
> > If everything were working 100% of the time, no risk would be justified
> > because no improvement is possible.
> > > But we run over 1,000,000 suspend resume cycles per release in our lab,
> > and this issue as a category, is the single most common failure.
>
> But you are starting to enter the big number category here, eventually
> something is going to fail with that many iterations.
>
> How was this 1 million iterations determined to be a good pass/fail
> criteria and just not an arbitrarily high number intended to shake off
> issues? Surely with such a big number you start getting an idea of which
> specific drivers within your test devices tend to fail to suspend?
>
> FWIW, with the products I work with, which are mainly set-top-box
> devices, we just set a pass/fail criteria at 100k which is essentially
> assuming there will be 27 suspend/resume cycles per day for the next 10
> years, given the lifespan of the products, that seemed way overboard,
> realistically there is going to be more like 2-3 suspend/resume cycles
> per day.
>
> >
> > Worse, there is a huge population of drivers, and we can't possibly test
> > them all into correctness. Every release this issue crops when another
> > driver hiccups in response to some device specific transient issue.
> >
> > The current implementation is not a viable design.
>
> Neither is this approach because it assumes that drivers that need to
> abort the system suspend call pm_system_wakeup(), which most do not,

Agree completely with Ulf and Florian. This patch series is way too
optimistic in thinking that all drivers are/should be calling
pm_system_wakeup().

I'd argue it makes more sense and is intuitive to return an error if a
device suspend fails and have the framework act on it immediately than
setting some global flag and then looking at a later point in the
framework.

Based on the rest of the emails in this thread, this seems more like a
driver problem than a framework issue. If you see a driver that's
returning an error when it shouldn't, fix it.

If you are testing this only on some specific set of hardware to claim
it works without problems, then you can also just go fix the drivers
for that specific set of hardware and call it a day. If that's "too
many drivers to fix", then the amount of untested drivers surely is
orders of magnitude more than that.

At best this should be a driver specific flag (why not just fix the
driver in that case) or a commandline arg that's default disabled.
Then whoever wants to use this sledgehammer for their hardware can do
so without affecting others.

-Saravana

> they return -EBUSY or something like that. There is a total of 12 or so
> drivers calling pm_system_wakeup(), that's not the majority.
>
> How about you flipped the logic around, introduce an option that lets
> you ignore the suspend callback return value gated by a Kconfig option?
> --
> Florian
>