Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id()

From: Matt Fleming

Date: Fri Apr 17 2026 - 11:41:24 EST


On Wed, Apr 15, 2026 at 07:16:53AM -0500, Corey Minyard wrote:
>
> I've seen this before in several scenarios, including a system that put
> IPMI in the ACPI tree and it sort of worked but there was no BMC
> present. I had to disable that particular device.
>
> What hardware is involved here?

I'm fairly sure we've seen this across a bunch of different BMCs, so
it's not a BMC hardware thing. Almost certainly a driver issue.

> Can you give a more detailed example of what's happening in the
> low-level hardware? If it's KCS there's a debug flag in the
> drivers/char/ipmi/ipmi_kcs_sm.c file that should help.

Yep, it's KCS. Unfortunately I haven't found a way to reproduce this
reliably yet.

Looking at a wedged machine (cat /sys/class/ipmi/.../firmware_revision)
with drgn I can see that there's 99,846 messages sat on intf->xmit_msgs
and the KCS SM is idle (it's responding to internal traffic like Get
Global Enables and Get Msg Flags). So it looks like completions are
getting dropped.

We're running a 6.18.18 kernel which includes c08ec55617cb ("ipmi: Fix
use-after-free and list corruption on sender error"), so it's not that.

Here's a dump of some of the data structures.

intf = 0xffff9d2f4a5a0000
intf->curr_msg = 0xffff9d34f21a9c00
intf->xmit_msgs.next = 0xffff9d30c28e3c80
intf->waiting_rcv_msgs = empty
intf->maintenance_mode = 0
intf->maintenance_mode_state = 0
intf->in_shutdown = false
intf->seq_table = 0/64 slots used
intf->smi_work.pending = 0

The stuck message itself — intf->curr_msg:

msg @ 0xffff9d34f21a9c00
.data = { 0x18, 0x01 } # NetFn 0x06 (App), cmd 0x01 = Get Device ID
.data_size = 2
.rsp_size = 38
.rsp[0..7] = 2c 01 00 00 ...


.done = free_smi_msg
.user_data = NULL
.msgid = (internal GDI poll)
.type = IPMI_SMI_MSG_TYPE_NORMAL


smi_info = 0xffff9d2f4a010000
smi_info->si_state = SI_NORMAL (0)
smi_info->curr_msg = 0xffff9d2f48c7b800
smi_info->waiting_msg = NULL
smi_info->interrupt_disabled = false
smi_info->supports_event_msg_buff = true
smi_info->io.irq = 0
smi_info->run_to_completion = false
smi_info->in_maintenance_mode = 0

Let me know if you want any other info. I'll try to trace this and
catch it reproducing.