Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id()

From: Tony Camuso

Date: Wed Apr 15 2026 - 11:49:28 EST


On Wed, Apr 15, 2026 at 12:59:30PM +0100, Matt Fleming wrote:
From: Matt Fleming <mfl...@cl...>

When the BMC does not respond to a "Get Device ID" command, the
wait_event() in __get_device_id() blocks forever in
TASK_UNINTERRUPTIBLE while holding bmc->dyn_mutex. Every subsequent
sysfs reader then piles up in D state. Replace with
wait_event_timeout() to return -EIO after 1 second.

On Wed, Apr 15, 2026 at 12:17:04PM, Corey Minyard wrote:
This is the second report I have of something like this. So
something is up. I'm adding Tony, who reported something like this
dealing with the watchdog.

The lower level driver should never not return an answer, it is
supposed to guarantee that it returns an error if the BMC doesn't
respond. So the bug is not here, the bug is elsewhere.

I've been tracking a related issue (RHEL customer case) where BMC
reset while the IPMI watchdog is active causes D-state hangs. This
appears to be the same root cause Matt is hitting.

I backported the recent upstream KCS/SI fixes to a RHEL 9 test kernel
(54 patches bringing it to mainline parity) and tested today on a
Dell R640.

Test: Trigger `ipmitool mc reset cold` while watchdog daemon is
running.

Results with backported fixes:

[ 245.376402] IPMI Watchdog: heartbeat completion received
[ 246.376392] IPMI Watchdog: heartbeat send failure: -16
[ 247.377560] IPMI Watchdog: heartbeat send failure: -16
...
[ 252.413240] IPMI Watchdog: set timeout error: -16

The watchdog daemon received error 16 (EBUSY) and eventually
initiated orderly shutdown:

write watchdog device gave error 16 = 'Device or resource busy'
shutting down the system because of error 16

Key finding: With the upstream fixes, the driver returns -EBUSY
instead of blocking forever. No D-state hang. The watchdog daemon
handles the error and initiates orderly reboot.

Note: There was still a delay of several minutes before the daemon
timed out and triggered shutdown. The driver returned errors
promptly, but the watchdog daemon's retry logic (error retry
time-out = 120 seconds) extended the overall recovery time. This
may warrant a separate look at whether the daemon's retry behavior
is appropriate when the BMC is completely unresponsive.

This confirms Corey's assessment - the bug is in the lower-level
driver not returning errors, not in __get_device_id(). Matt's
timeout patch would be a defensive fallback, but the real fix is
ensuring KCS/SI properly returns errors when the BMC is
unresponsive.

Tony