Re: [PATCH] mctp i2c: check packet length before marking flow active
From: William A. Kennington III
Date: Thu May 07 2026 - 03:50:53 EST
On 5/6/26 01:01, Jeremy Kerr wrote:
Hi William,
OK, that's good news!Just to clarify my understanding of the state: "being held by twoI think it’s actually this, 2 threads are waiting on acquiring the lock.
owners" would indicate a violation of the lock itself. Or is it that
there are two threads blocked waiting to acquire the mutex?
There was a theory that it was a lock underflow that allowed 2 threadsNot sure what you're referring to here; if the userspace application is
to acquire the lock that lead to this patch.
For NVMe-MI, you're likely using manual tag allocation, where the tagYeah, this is very plausible given the device(s) stop responding
allocation (and hence flow state) is entirely controlled by userspace.
It may be that the NVMe protocol-level errors are causing that tags to
be held for long durations, perhaps?
correctly. I imagine we are getting stuck with manual allocations and
not releasing locks. Can we reset the state machine back to NEW instead
of holding the lock?
not releasing the tag, we have to keep the i2c bus locked, otherwise we
may not receive a response from the device.
Isn't this inherently an approach asking for trouble, where a potentially buggy userspace can starve out other applications which need to access the bus. For us we have FRU devices on the that are periodically rescanned or accessed for various reasons alongside the MCTP endpoint on the NVME device.
The one case I can think of (in upstream infrastructure, at least) isYeah, I'll have to look at the specific firmware bug more but I don't think it's been oot caused it fully yet.
that this might be triggered by the device reporting a long MPRT value,
and then a response gets lost. libnvme is respecting the MPRT, and not
releasing the tag for that (excessive) duration.
However, the tag -> i2c lock associations are only useful if you have
muxes in the i2c topology. Is that the case on your platform? If not,
perhaps we could elide all the bus locking when we can detect that...
We have at least 1 layer of mux before each NVME and FRU device.
Cheers,
Jeremy