Re: [PATCH v1] PM: sleep: Restore asynchronous device resume optimization

From: Rafael J. Wysocki
Date: Wed Feb 07 2024 - 06:00:12 EST


On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski
<m.szyprowski@xxxxxxxxxxx> wrote:
>
> Dear All,
>
> On 09.01.2024 17:59, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> >
> > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
> > system-wide PM code"), the resume of devices that were allowed to resume
> > asynchronously was scheduled before starting the resume of the other
> > devices, so the former did not have to wait for the latter unless
> > functional dependencies were present.
> >
> > Commit 7839d0078e0d removed that optimization in order to address a
> > correctness issue, but it can be restored with the help of a new device
> > power management flag, so do that now.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> > ---
>
> This patch finally landed in linux-next some time ago as 3e999770ac1c
> ("PM: sleep: Restore asynchronous device resume optimization"). Recently
> I found that it causes a non-trivial interaction with commit
> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next
> system suspend/resume fails (board doesn't wake up) on my old Samsung
> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock
> stable for last years.
>
> My further investigations confirmed that the mentioned commits are
> responsible for this issue. Each of them separately (3e999770ac1c and
> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top
> of linux-next (with some additional commit due to code dependencies)
> also fixes/hides the problem.
>
> Let me know if You need more information or tests on the hardware. I'm
> open to help debugging this issue.

So I found this report:

https://lore.kernel.org/lkml/b3d08cd8-d77f-45dd-a2c3-4a4db5a98dfa@xxxxxxxxxx/

which appears to be about the same issue.

Strictly speaking, the regression is introduced by 5797b1c18919,
because it is not a mainline commit yet, but the information regarding
the interaction of it with commit 3e999770ac1c is valuable.

Essentially, what commit 3e999770ac1c does is to schedule the
execution of all of the async resume callbacks in a given phase
upfront, so they can run without waiting for the sync ones to complete
(except when there is a parent-child or supplier-consumer dependency -
the latter represented by a device link).

What seems to be happening after commit 5797b1c18919 is that one (or
more) of the async callbacks get blocked in the workqueue for some
reason.

I would try to replace system_unbound_wq in
__async_schedule_node_domain() with a dedicated workqueue that is not
unbound and see what happens.