Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id()
From: Corey Minyard
Date: Fri Apr 17 2026 - 19:56:07 EST
On Fri, Apr 17, 2026 at 11:23:03PM +0100, Matt Fleming wrote:
> On Wed, Apr 15, 2026 at 07:16:53AM -0500, Corey Minyard wrote:
> >
> > The lower level driver should never not return an answer, it is supposed
> > to guarantee that it returns an error if the BMC doesn't respond.
> >
> > So the bug is not here, the bug is elsewhere. My guess is that there
> > is some new failure mode where a BMC is not working but it responds well
> > enough that it sort of works and fools the driver. But that's only a
> > guess.
>
> I can now reproduce this pretty reliably by running concurrent
> ipmitool commands (sensor/sel/mc info) + sysfs readers + periodic
> ipmitool mc reset cold. It wedges in a few minutes.
Hmm. If you are sending cold resets, then the driver is going into
reset maintenance mode and it should be rejecting messages for 30
seconds after you send that command.
You can disable that by changing is_maintenance_mode_cmd() in
ipmi_msghandler.c to always return false.
>
> My working theory is handle_flags() in ipmi_si_intf.c can loop on
> flag-driven commands (e.g. READ_EVENT_MSG_BUFFER) without ever calling
> start_next_msg(), starving waiting_msg indefinitely.
>
> Captured state at wedge:
>
> si_state=SI_GETTING_EVENTS msg_flags=0x02
> si_curr cycling cmd=0x35 (READ_EVENT_MSG_BUFFER)
> si_wait frozen cmd=0x08 (GET_DEVICE_GUID, never promoted)
>
> The cold reset makes the BMC report EVENT_MSG_BUFFER_FULL during
> re-init, which drives the flag loop.
The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful
READ_EVENT_MSG_BUFFER command completes. Getting data from the
BMC has higher priority than sending data to the BMC.
If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then
that would certainly wedge the driver. But it would have to continually
report success for that command, which would be strange as its supposed
to error out when the queue is empty.
If it's really something like that, I could also look at adding limits
for those operations.
To debug things like this I often add module_params that let me see what
is going on. But you can look at the "invalid_events" counter to see
if the data is bogus. Or there should be an "Event queue full,
discarding incoming events" log coming out once at the beginning of when
this happens.
-corey
>
> Thanks,
> Matt