Re: [PATCH v1] PM: sleep: Restore asynchronous device resume optimization

From: Rafael J. Wysocki
Date: Wed Feb 07 2024 - 06:26:12 EST


On Wed, Feb 7, 2024 at 12:16 PM Marek Szyprowski
<m.szyprowski@xxxxxxxxxxx> wrote:
>
> On 07.02.2024 11:38, Rafael J. Wysocki wrote:
> > On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski
> > <m.szyprowski@xxxxxxxxxxx> wrote:
> >> On 09.01.2024 17:59, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki<rafael.j.wysocki@xxxxxxxxx>
> >>>
> >>> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
> >>> system-wide PM code"), the resume of devices that were allowed to resume
> >>> asynchronously was scheduled before starting the resume of the other
> >>> devices, so the former did not have to wait for the latter unless
> >>> functional dependencies were present.
> >>>
> >>> Commit 7839d0078e0d removed that optimization in order to address a
> >>> correctness issue, but it can be restored with the help of a new device
> >>> power management flag, so do that now.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki<rafael.j.wysocki@xxxxxxxxx>
> >>> ---
> >> This patch finally landed in linux-next some time ago as 3e999770ac1c
> >> ("PM: sleep: Restore asynchronous device resume optimization"). Recently
> >> I found that it causes a non-trivial interaction with commit
> >> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
> >> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next
> >> system suspend/resume fails (board doesn't wake up) on my old Samsung
> >> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock
> >> stable for last years.
> >>
> >> My further investigations confirmed that the mentioned commits are
> >> responsible for this issue. Each of them separately (3e999770ac1c and
> >> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top
> >> of linux-next (with some additional commit due to code dependencies)
> >> also fixes/hides the problem.
> >>
> >> Let me know if You need more information or tests on the hardware. I'm
> >> open to help debugging this issue.
> > If you echo 0 to /sys/power/pm_async before suspending the system,
> > does it still fail?
>
> In such case it works fine.

Thanks for the confirmation.

It doesn't rely on unbound workqueues then, so that's expected.

Now, I think that there are two possibilities.

One is that commit 3e999770ac1c is generally overoptimistic for your
board and there is a dependency between devices which is not
represented by a device link, and it causes things to go south when
they are not done in a specific order. If that is the case and commit
5797b1c18919 changes that order, breakage ensues.

The other one is that what happens during async resume does not meet
the assumptions of commit 5797b1c18919 (for example, it can easily
produce a chain of interdependent work items longer than 8) and so it
breaks things.

I would still try to use a non-unbound workqueue for the async thing,
because if it works reliably then, the second possibility will be more
likely.