Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id()
From: Matt Fleming
Date: Fri Apr 17 2026 - 18:23:14 EST
On Wed, Apr 15, 2026 at 07:16:53AM -0500, Corey Minyard wrote:
>
> The lower level driver should never not return an answer, it is supposed
> to guarantee that it returns an error if the BMC doesn't respond.
>
> So the bug is not here, the bug is elsewhere. My guess is that there
> is some new failure mode where a BMC is not working but it responds well
> enough that it sort of works and fools the driver. But that's only a
> guess.
I can now reproduce this pretty reliably by running concurrent
ipmitool commands (sensor/sel/mc info) + sysfs readers + periodic
ipmitool mc reset cold. It wedges in a few minutes.
My working theory is handle_flags() in ipmi_si_intf.c can loop on
flag-driven commands (e.g. READ_EVENT_MSG_BUFFER) without ever calling
start_next_msg(), starving waiting_msg indefinitely.
Captured state at wedge:
si_state=SI_GETTING_EVENTS msg_flags=0x02
si_curr cycling cmd=0x35 (READ_EVENT_MSG_BUFFER)
si_wait frozen cmd=0x08 (GET_DEVICE_GUID, never promoted)
The cold reset makes the BMC report EVENT_MSG_BUFFER_FULL during
re-init, which drives the flag loop.
Thanks,
Matt