Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id()

From: Frederick Lawler

Date: Wed Apr 15 2026 - 17:22:59 EST


Hi Corey & Tony,

On Wed, Apr 15, 2026 at 11:46:27AM -0400, 'Tony Camuso' via kernel-team wrote:
> On Wed, Apr 15, 2026 at 12:59:30PM +0100, Matt Fleming wrote:
> > From: Matt Fleming <mfl...@cl...>
> >
> > When the BMC does not respond to a "Get Device ID" command, the
> > wait_event() in __get_device_id() blocks forever in
> > TASK_UNINTERRUPTIBLE while holding bmc->dyn_mutex. Every subsequent
> > sysfs reader then piles up in D state. Replace with
> > wait_event_timeout() to return -EIO after 1 second.
>
> On Wed, Apr 15, 2026 at 12:17:04PM, Corey Minyard wrote:
> > This is the second report I have of something like this. So
> > something is up. I'm adding Tony, who reported something like this
> > dealing with the watchdog.
> >
> > The lower level driver should never not return an answer, it is
> > supposed to guarantee that it returns an error if the BMC doesn't
> > respond. So the bug is not here, the bug is elsewhere.

This is a bit of a throwback to our previous discussions around [1].

I did end up applying [2] based on that discussion, and had limited
success, but we still have external resets that cause us to enter
this undesirable state :(

[1]: https://lore.kernel.org/all/aJUMlAG17c6lCgFq@xxxxxxxxxxxxxxxx/
[2]: https://lore.kernel.org/all/20250807230648.1112569-2-corey@xxxxxxxxxxx/
>
> I've been tracking a related issue (RHEL customer case) where BMC
> reset while the IPMI watchdog is active causes D-state hangs. This
> appears to be the same root cause Matt is hitting.
>
> I backported the recent upstream KCS/SI fixes to a RHEL 9 test kernel
> (54 patches bringing it to mainline parity) and tested today on a
> Dell R640.

I assume this patch series: "ipmi:watchdog: Fix panic, D-state hang, and
lost protection on BMC reset" [3]?

[3]: https://lore.kernel.org/all/20260407175134.3367345-1-tcamuso@xxxxxxxxxx/