Re:[PATCH] drm/ast: Add timeouts to AHB/SCU polling loops to prevent soft lockups
From: w15303746062
Date: Thu Jun 04 2026 - 23:23:50 EST
Hi Thomas and all,
A gentle ping on this patch.
It has been about three weeks since submission.
This patch fixes a critical soft lockup issue we observed when the ASPEED hardware becomes unresponsive.
Since it uses a safe timeout mechanism without altering the existing API signatures, it should be a straightforward and low-risk mitigation.
Could anyone please help review this when you have a moment, or let me know if any modifications are needed?
Best regards,
Mingyu
At 2026-05-13 19:39:49, w15303746062@xxxxxxx wrote:
>From: Mingyu Wang <25181214217@xxxxxxxxxxxxxxxxx>
>
>While validating the driver using DevGen (a framework that synthesizes
>virtual device models directly from driver source code via LLM guidance),
>a severe soft lockup was observed.
>
>The hardware polling loops in `__ast_mindwm`, `__ast_moutdwm`, and
>`ast_2500_patch_ahb` lack a timeout mechanism. On bare-metal systems,
>if the ASPEED chip becomes unresponsive or a PCIe bus fault occurs,
>the CPU will spin indefinitely in these loops. This results in a system
>hang, triggering the watchdog soft lockup and causing subsequent I/O
>starvation (e.g., blocking jbd2).
>
>Fix this by introducing a bounded loop with a safe timeout of
>approximately 100ms using `udelay(10)`. Using `udelay()` ensures
>that the fix remains safe even if these accessors are called from
>an atomic context or while holding spinlocks. If the hardware fails
>to respond, the loop breaks and emits a `WARN_ONCE`, allowing the
>kernel to degrade gracefully and preventing complete system paralysis.
>
>Signed-off-by: Mingyu Wang <25181214217@xxxxxxxxxxxxxxxxx>
>---
>Hi Thomas,
>
>Thanks for the prompt response and confirmation!
>
>Instead of just waiting for threshold suggestions, I have drafted this
>patch to address the soft lockup. Since changing the return types of
>`__ast_mindwm` and `__ast_moutdwm` to propagate error codes (e.g.,
>`-ETIMEDOUT`) would require an intrusive refactoring across the entire
>AST driver, I took a more defensive, minimal-invasive approach.
>
>To avoid the risk of sleeping in an atomic context (in case these
>low-level I/O accessors are ever called under a spinlock), I used
>a bounded loop with `udelay(10)` and a maximum of 10000 iterations
>(approx. 100ms total timeout). If the ASPEED hardware completely
>fails to respond, it breaks the infinite loop, emits a `WARN_ONCE`,
>and prevents the CPU from halting the entire system.
>
>Please let me know if you think the 100ms threshold and the `udelay`
>approach are appropriate for this specific AHB/SCU hardware sequence.
>
> drivers/gpu/drm/ast/ast_2500.c | 8 +++++++-
> drivers/gpu/drm/ast/ast_post.c | 16 ++++++++++++++--
> 2 files changed, 21 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/gpu/drm/ast/ast_2500.c b/drivers/gpu/drm/ast/ast_2500.c
>index 2a52af0ded56..08d18f90201a 100644
>--- a/drivers/gpu/drm/ast/ast_2500.c
>+++ b/drivers/gpu/drm/ast/ast_2500.c
>@@ -107,6 +107,7 @@ static const u32 ast2500_ddr4_1600_timing_table[REGTBL_NUM] = {
> void ast_2500_patch_ahb(void __iomem *regs)
> {
> u32 data;
>+ int retries = 10000; /* ~100ms timeout */
>
> /* Clear bus lock condition */
> __ast_moutdwm(regs, 0x1e600000, 0xAEED1A03);
>@@ -136,7 +137,12 @@ void ast_2500_patch_ahb(void __iomem *regs)
> do {
> __ast_moutdwm(regs, 0x1e6e2000, 0x1688A8A8);
> data = __ast_mindwm(regs, 0x1e6e2000);
>- } while (data != 1);
>+ if (data == 1)
>+ break;
>+ udelay(10);
>+ } while (--retries);
>+
>+ WARN_ONCE(!retries, "ast: timeout waiting for AHB patch\n");
>
> __ast_moutdwm(regs, 0x1e6e207c, 0x08000000); /* clear fast reset */
> }
>diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c
>index b72914dbed38..66eb80925e27 100644
>--- a/drivers/gpu/drm/ast/ast_post.c
>+++ b/drivers/gpu/drm/ast/ast_post.c
>@@ -37,13 +37,19 @@
> u32 __ast_mindwm(void __iomem *regs, u32 r)
> {
> u32 data;
>+ int retries = 10000; /* ~100ms timeout */
>
> __ast_write32(regs, 0xf004, r & 0xffff0000);
> __ast_write32(regs, 0xf000, 0x1);
>
> do {
> data = __ast_read32(regs, 0xf004) & 0xffff0000;
>- } while (data != (r & 0xffff0000));
>+ if (data == (r & 0xffff0000))
>+ break;
>+ udelay(10);
>+ } while (--retries);
>+
>+ WARN_ONCE(!retries, "ast: timeout reading from AHB/SCU\n");
>
> return __ast_read32(regs, 0x10000 + (r & 0x0000ffff));
> }
>@@ -51,13 +57,19 @@ u32 __ast_mindwm(void __iomem *regs, u32 r)
> void __ast_moutdwm(void __iomem *regs, u32 r, u32 v)
> {
> u32 data;
>+ int retries = 10000; /* ~100ms timeout */
>
> __ast_write32(regs, 0xf004, r & 0xffff0000);
> __ast_write32(regs, 0xf000, 0x1);
>
> do {
> data = __ast_read32(regs, 0xf004) & 0xffff0000;
>- } while (data != (r & 0xffff0000));
>+ if (data == (r & 0xffff0000))
>+ break;
>+ udelay(10);
>+ } while (--retries);
>+
>+ WARN_ONCE(!retries, "ast: timeout writing to AHB/SCU\n");
>
> __ast_write32(regs, 0x10000 + (r & 0x0000ffff), v);
> }
>--
>2.34.1