Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id()
From: Corey Minyard
Date: Mon Apr 20 2026 - 14:11:37 EST
On Mon, Apr 20, 2026 at 11:33:01AM -0500, Corey Minyard wrote:
> On Sun, Apr 19, 2026 at 09:50:38PM +0100, Matt Fleming wrote:
> > On Fri, Apr 17, 2026 at 06:53:55PM -0500, Corey Minyard wrote:
> > >
> > > The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful
> > > READ_EVENT_MSG_BUFFER command completes. Getting data from the
> > > BMC has higher priority than sending data to the BMC.
> > >
> > > If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then
> > > that would certainly wedge the driver. But it would have to continually
> > > report success for that command, which would be strange as its supposed
> > > to error out when the queue is empty.
> >
> > That does indeed appear to be what's happening.
> >
> > The implementation of intel-ipmi-oem's OpenBMC READ_EVENT_MSG_BUFFER
> > handler does not fail when there is nothing to read,
> >
> > https://github.com/openbmc/intel-ipmi-oem/blob/master/src/bridgingcommands.cpp#L704
>
> Actually, that is so clearly wrong that it's hard to imagine they made
> this mistake. Anyway, a defect needs to be filed against it. It will
> certainly break other software on other OSes.
>
> I think I have an easy workaround, though I'm guessing. I'm guessing
> they are returning zero data bytes. There's no check on the size at that
> point in the code (it's later).
>
> Can you try the following patch?
Actually ignore that, I was thinking too much about the other stuff and
didn't actually read my patch.
Here's a working patch:
diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 4a9e9de4d684..08c208cc64c5 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info *smi_info)
*/
msg = smi_info->curr_msg;
smi_info->curr_msg = NULL;
- if (msg->rsp[2] != 0) {
+ /*
+ * It appears some BMCs, with no event data, return no
+ * data in the message and not a 0x80 error as the
+ * spec says they should. Shut down processing if
+ * the data is not the right length.
+ */
+ if (msg->rsp[2] != 0 || msg->rsp_size != 19) {
/* Error getting event, probably done. */
msg->done(msg);
>
> diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
> index 4a9e9de4d684..cf8674a93af1 100644
> --- a/drivers/char/ipmi/ipmi_si_intf.c
> +++ b/drivers/char/ipmi/ipmi_si_intf.c
> @@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info *smi_info)
> */
> msg = smi_info->curr_msg;
> smi_info->curr_msg = NULL;
> - if (msg->rsp[2] != 0) {
> + /*
> + * It appears some BMCs, with no event data, return no
> + * data in the message and not a 0x80 error as the
> + * spec says they should. Shut down processing if
> + * the data is not the right length.
> + */
> + if (msg->rsp[2] != 0 || smi_info->curr_msg->rsp_size != 19) {
> /* Error getting event, probably done. */
> msg->done(msg);
>
>
> With your approval on that, I'll send it to Linus after letting it sit
> in the next tree for a bit. Actually, I'll probably add it in any case,
> as it's a good check to do.
>
> >
> > > If it's really something like that, I could also look at adding limits
> > > for those operations.
> >
> > That would be great. Me and Fred would be happy to test out any patch.
> >
> > I still think the original patch I sent is a worthwhile defense.
> >
> > Our periodic monitoring scripts cause TASK_UNINTERRUPTIBLE tasks to
> > block behind one another when we hit these kinds of issues in the IPMI
> > code. Untangling that across thousands of machines can be time
> > consuming and a more explicit EIO or ETIMEDOUT would help with triage.
>
> Unfortunately, that might have other issues, similar to the ones the
> people with the watchdog issue found. I'll look at it a bit, but those
> sorts of things would have to be scattered all over the code, not just
> in that one place. As you say, it would make debugging easier.
>
> I think adding a counter to the number of operations occuring from a
> single flag fetch would be a way to avoid this issue. That's going to
> take a little more time, but I'll definitely work on that.
>
> Thanks,
>
> -corey