Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable

From: YangYang
Date: Mon Dec 01 2025 - 04:47:23 EST

Next message: Geert Uytterhoeven: "Re: m68k-linux-ld: drivers/net/phy/air_en8811h.o:(.debug_addr+0x38): undefined reference to `clk_save_context'"
Previous message: Thomas Bogendoerfer: "Re: [PATCH] mips: kvm: simplify kvm_mips_deliver_interrupts()"
Next in thread: YangYang: "Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2025/11/27 20:34, Rafael J. Wysocki wrote:

On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@xxxxxxx> wrote:

On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:

On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@xxxxxxx> wrote:

On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
if (flags & BLK_MQ_REQ_NOWAIT)
return -EAGAIN;

+ /* if necessary, resume .dev (assume success). */
+ blk_pm_resume_queue(pm, q);
/*
* read pair of barrier in blk_freeze_queue_start(), we need to
* order reading __PERCPU_REF_DEAD flag of .q_usage_counter and

blk_queue_enter() may be called from the suspend path so I don't think
that the above change will work.

Why would the existing code work then?

The existing code works reliably on a very large number of devices.

Well, except that it doesn't work during system suspend and
hibernation when the PM workqueue is frozen. I think that we agree
here.

This needs to be addressed because it may very well cause system
suspend to deadlock.

There are two possible ways to address it I can think of:

1. Changing blk_pm_resume_queue() and its users to carry out a
synchronous resume of q->dev instead of calling pm_request_resume()
and (effectively) waiting for the queued-up runtime resume of q->dev
to take effect.

This would be my preferred option, but at this point I'm not sure if
it's viable.

After __pm_runtime_disable() is called from device_suspend_late(), dev->power.disable_depth is set, preventing rpm_resume() from making progress until the system resume completes, regardless of whether rpm_resume() is invoked synchronously or asynchronously.
Performing a synchronous resume of q->dev seems to have a similar effect to removing the following code block from __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():

1428 if (dev->power.request_pending) {
1429 dev->power.request = RPM_REQ_NONE;
1430 spin_unlock_irq(&dev->power.lock);
1431
1432 cancel_work_sync(&dev->power.work);
1433
1434 spin_lock_irq(&dev->power.lock);
1435 dev->power.request_pending = false;
1436 }

2. Stop freezing the PM workqueue before system suspend/hibernation
and adapt device_suspend_late() to that.

This should be doable, even though it is a bit risky because it may
uncover some latent bugs (the freezing of the PM workqueue has been
there forever), but it wouldn't address the problem entirely because
device_suspend_late() would still need to disable runtime PM for the
device (and for some devices it is disabled earlier), so
pm_request_resume() would just start to fail at that point and if
blk_queue_enter() were called after that point for a device supporting
runtime PM, it might deadlock.

Maybe there is a misunderstanding? RQF_PM / BLK_MQ_REQ_PM are set for
requests that should be processed even if the power status is changing
(RPM_SUSPENDING or RPM_RESUMING). The meaning of the 'pm' variable is
as follows: process this request even if a power state change is
ongoing.

I see.

The behavior depends on whether or not q->pm_only is set. If it is
not set, both blk_queue_enter() and __bio_queue_enter() will allow the
request to be processed.

If q->pm_only is set, __bio_queue_enter() will wait until it gets
cleared and in that case pm_request_resume(q->dev) is called to make
that happen (did I get it right?). This is a bit fragile because what
if the async resume of q->dev fails for some reason? You deadlock
instead of failing the request.

Unlike __bio_queue_enter(), blk_queue_enter() additionally checks the
runtime PM status of the queue if q->pm_only is set and it will allow
the request to be processed in that case so long as q->rpm_status is
not RPM_SUSPENDED. However, if the queue status is RPM_SUSPENDED,
pm_request_resume(q->dev) will be called like in the
__bio_queue_enter() case.

I'm not sure why pm_request_resume(q->dev) needs to be called from
within blk_pm_resume_queue(). Arguably, it should be sufficient to
call it once before using the wait_event() macro, if the conditions
checked by blk_pm_resume_queue() are not met.

Are you suggesting that q->rpm_status should still be checked before
calling pm_runtime_resume() or do you mean something else?

The purpose of the code changes from a previous email is not entirely
clear to me so I'm not sure what the code should look like. But to
answer your question, calling blk_pm_resume_queue() if the runtime
status is RPM_SUSPENDED should be safe.

As an example, the UFS driver submits a
SCSI START STOP UNIT command from its runtime suspend callback. The call
chain is as follows:

ufshcd_wl_runtime_suspend()
__ufshcd_wl_suspend()
ufshcd_set_dev_pwr_mode()
ufshcd_execute_start_stop()
scsi_execute_cmd()
scsi_alloc_request()
blk_queue_enter()
blk_execute_rq()
blk_mq_free_request()
blk_queue_exit()

In any case, calling pm_request_resume() from blk_pm_resume_queue() in
the !pm case is a mistake.

Hmm ... we may disagree about this. Does what I wrote above make clear
why blk_pm_resume_queue() is called if pm == false?

Yes, it does, thanks!

Next message: Geert Uytterhoeven: "Re: m68k-linux-ld: drivers/net/phy/air_en8811h.o:(.debug_addr+0x38): undefined reference to `clk_save_context'"
Previous message: Thomas Bogendoerfer: "Re: [PATCH] mips: kvm: simplify kvm_mips_deliver_interrupts()"
Next in thread: YangYang: "Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]