Re: [RFC] IPMI state machine regression

From: Andrew Banman
Date: Thu Aug 23 2018 - 12:52:38 EST


On Thu, Aug 23, 2018 at 11:22:58AM -0500, Corey Minyard wrote:
> On 08/22/2018 11:23 AM, Andrew Banman wrote:
> > On Wed, Aug 22, 2018 at 11:14:52AM -0500, Corey Minyard wrote:
> > > On 08/21/2018 05:14 PM, Andrew Banman wrote:
> > > > Dear IPMI supporters,
> > > >
> > > > We observe a window in IPMI BT's opportunistic get capabilities request,
> > > > wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
> > > > machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
> > > > bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
> > > > There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
> > > > This leaves IPMI hosed until you reload modules. Looks to happen after we call
> > > > schedule().
> > > When was the latest kernel where this worked properly?  Also, what hardware
> > > is this?
> > This is UV4.
> >
> > First known bad commit, but I am not sure if the timing issue predates
> > it:
> >
> > commit aa9c9ab2443e3b9562c6c7cfc245a9e43b557d14
> > Author: Jeremy Kerr <jk@xxxxxxxxxx>
> > Date: Fri Aug 25 15:47:24 2017 +0800
> >
> > ipmi: allow dynamic BMC version information
> >
> > Hits less frequently with older kernels so I didn't see it until
> > recently when it became more frequent.
>
> Ok, that's for the crash, which makes sense.  But that's an easy problem to
> fix.
> I would like a "Tested-by" on that, if you get to test it, though I was able
> to
> simulate various failures there to test it out.

Testing it today. We'll respond to the patch you sent out. Thanks for
fixing that!

>
> So reading between the lines ("more frequent") I'm guessing that this still
> happened with older kernels, but is becoming annoying with newer kernels.
> I would guess recent changes causes it to happen more often due to changes
> in the way the upper layer interacts with the lower layers, you will have
> more
> messages at startup, and the timing is somewhat different.
>
> The BT code itself hasn't changed much in over 10 years.  Nothing that
> looks like it would cause an issue like this.  So I would guess this is an
> issue that has been around for a while.

That's what I suspect as well.

>
> I don't have any real hardware with a BT interface, just the one in qemu,
> but I've never seen it there.
>
> It actually looks like the state machine is working ok.  But the BMC is
> responding to a "Get Device ID" command with:
>
> Recv:: 1c 08 d5

That completion code is spoofed in after bt_start_transaction when bt->state
is in WR_CONSUME. This is pretty visible from the message trace, but here
is another with an explicit debug printk (below).

Maybe I'm totally reading this wrong, but I think that err gets passed
on to the upper layer:

start_transaction: return 0xD5 -> return_hosed_msg -> deliver_recv_msg
-> ipmi_smi_msg_received


[ 120.367379] BT: IDLE [ ] TO=5000000 - 0
[ 120.367384] BT: CAP_BEGIN [ ] TO=5000000 - 0
[ 120.376634] **Timer: 1534434704.376633453
[ 120.376646] BT: XACTION [ ] TO=5000000 - 92000
[ 120.376683] BT: WR_BYTES [ ] TO=4908000 - 0
[ 120.376995] BT: WR_CONSUME [ H2B ] TO=4908000 - 0
[ 120.377025] Send: 18 08
[ 120.377027] **Enqueue: 1534434704.377028726
[ 120.377028] **Start2: 1534434704.377029859
# Added DBG # [ 120.377028] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
# Added DBG # [ 120.377030] 18
# Added DBG # [ 120.377030] 08
# Added DBG # [ 120.377030]
[ 120.377061] BT: WR_CONSUME [ ] TO=4908000 - 0
[ 120.377137] BT: RD_WAIT [ ] TO=4908000 - 0
[ 120.377140] Recv:: 1c 08 d5
[ 120.377156] Send: 18 01
[ 120.377161] **Enqueue: 1534434704.377161047
[ 120.377164] **Start2: 1534434704.377164751
# Added DBG # [ 120.377165] bt_start_transaction returned IPMI_NOT_IN_MY_STATE_ERR 0xd5
# Added DBG # [ 120.377166] 18
# Added DBG # [ 120.377167] 01
# Added DBG # [ 120.377168]
[ 120.377226] BT: RD_WAIT [ ] TO=4908000 - 0
[ 120.377237] Recv:: 1c 01 d5
[ 120.377242] ipmi_si IPI0001:00: IPMI message handler: device id demangle failed: -22
[ 120.377252] ipmi_si IPI0001:00: Unable to get the device id: -5
[ 120.377252] ipmi_si IPI0001:00: Unable to register device: error -5


diff --git a/drivers/char/ipmi/ipmi_bt_sm.c b/drivers/char/ipmi/ipmi_bt_sm.c
index d99d744..1be4de9 100644
--- a/drivers/char/ipmi/ipmi_bt_sm.c
+++ b/drivers/char/ipmi/ipmi_bt_sm.c
@@ -24,7 +24,7 @@
* value
*/

-static int bt_debug = 4; /* 0 == BT_DEBUG_OFF */
+static int bt_debug = 2; /* 0 == BT_DEBUG_OFF */

module_param(bt_debug, int, 0644);
MODULE_PARM_DESC(bt_debug, "debug bitmask, 1=enable, 2=messages, 4=states");
@@ -217,8 +217,14 @@ static int bt_start_transaction(struct si_sm_data *bt,
if (bt->state == BT_STATE_LONG_BUSY)
return IPMI_NODE_BUSY_ERR;

- if (bt->state != BT_STATE_IDLE)
+ if (bt->state != BT_STATE_IDLE) {
+ pr_crit("bt->state = %d : bt_start_transaction returned "
+ "IPMI_NOT_IN_MY_STATE_ERR 0xd5\n", bt->state);
+ for (i = 0; i < size; i ++)
+ printk(" %02x", data[i]);
+ printk("\n");
return IPMI_NOT_IN_MY_STATE_ERR;
+ }

if (bt_debug & BT_DEBUG_MSG) {
printk(KERN_WARNING "BT: +++++++++++++++++ New command\n");

>
>
> That's an error response with D5, which is "Cannot execute command.
> Command, or request parameter(s), not supported in present state."
> That's an error response from your BMC.  That particular command
> shouldn't ever respond with that error, so I think the bug here is
> with your BMC.

My BMC engineers and I cannot reproduce the problem running ipmi commands
direct from the BMC.

>
> -corey
>
>

Thank you,

Andrew