Re: [PATCH] mctp i2c: check packet length before marking flow active
From: William A. Kennington III
Date: Wed Apr 29 2026 - 05:09:21 EST
On 4/23/26 21:16, Jeremy Kerr wrote:
Hi William,Yeah I think you might be right, the hard part is reproducing this is so infrequent for us that it takes a long time to iterate on testing these changes.
OK, sounds good for the overall fix, but I don't think that would beOut of curiosity though, how did you hit the hdr_byte_count mismatch inOur current theory is that we have known buggy firmware on our NVME MCTP
the first place?
devices and we are seeing some kind of corruption on the bus that we are
going to fix in on the firmware side.
causing the path that you're addressing here. The fix is definitely
valid, but can't be hit through any RX data corruption (we're in the
TX path).
The header byte count is populated during header construction, so a
mismatch here would indicate modification of the skb between that point
at the actual xmit. Do you see the "Bad TX len" warning in these cases?
I double checked and so far I can’t find evidence of it. Probably we still want to keep this change, but it’s not the root of our problems.
I think it’s actually this, 2 threads are waiting on acquiring the lock. There was a theory that it was a lock underflow that allowed 2 threads to acquire the lock that lead to this patch.We started also seeing kernelJust to clarify my understanding of the state: "being held by two
crashes along with the bad firmware symptoms, walked through ~110 kdumps
and found i2c locks that were held by 2 owners (eeprom reading and the
MCTP TX queue).
owners" would indicate a violation of the lock itself. Or is it that
there are two threads blocked waiting to acquire the mutex?
For NVMe-MI, you're likely using manual tag allocation, where the tag
allocation (and hence flow state) is entirely controlled by userspace.
It may be that the NVMe protocol-level errors are causing that tags to
be held for long durations, perhaps?
Yeah, this is very plausible given the device(s) stop responding correctly. I imagine we are getting stuck with manual allocations and not releasing locks. Can we reset the state machine back to NEW instead of holding the lock?
Cheers,
Jeremy