Re: [PATCH] dirvers: scmi: poll again when transfer reach timeout
From: Sudeep Holla
Date: Tue Feb 11 2025 - 10:37:42 EST
On Thu, Jan 23, 2025 at 07:48:53PM +0000, Cristian Marussi wrote:
> On Thu, Jan 23, 2025 at 01:38:30PM +0300, Dan Carpenter wrote:
> > s/dirvers/drivers/
> >
> > On Thu, Jan 23, 2025 at 04:33:24PM +0800, jack21 wrote:
> > > From: Huangjie <huangjie1663@xxxxxxxxxxxxxx>
> > >
> > > spin_until_cond() not really hold a spin lock, possible timeout may occur
> > > in preemption kernel when preempted after spin_until_cond().
> > >
> > > We check status again when reach timeout is reached to prevent incorrect
> > > jugement of timeout.
> > >
>
> Hi,
>
> probably another not so short email of mine :P ...
>
> >
> > I suspect the real issue is that we exit the spin loop when
> > try_wait_for_completion(&xfer->done) is true. Probably we should add
> > that as a Fixes tag?:
> >
>
> The Kernel SCMI stack, acting as an SCMI agent have to cope with the
> possible (even though rare) scenario of receiving Out of Order messages
> when dealing with async commands...
>
> ...IOW, what to do if, after having issued an AsycnCmd, the Delayed response
> is received BEFORE the corresponding immediate reply to the initial request:
> such a wicked (and rare) situation could be the result of a misbehaving
> platform server OR simply due to parallellization of activies on the A2P
> and P2A on some transports, i.e. platform sent reply and delayed_response
> in the correct order but the transport delivered those OoO, being
> transmitted on 2 discinct physical channels...hard but not impossible.
>
> Kernel side we address this OoO scenario by assuming that, if a valid
> Delayed Response to a in-flight Async-cmd is received on teh P2A chan,
> before the immediate A2P reply, the transaction itself is good and we
> can progress by just logging and ignoring/swallowing the missing
> immediate-reply, that will probably arrive later, and just carry on
> processing the Delayed Response.
>
> In order to do that, we maintain, in fact, a per-message state-machine,
> and inside scmi_msg_response_validate(), when we detect the OoO condition,
> we cause the wait-for-immediate-reply to terminate by issuing forcibly a
> complete(xfer->done)....
>
> ...this works straight away for the non-polled IRQ transactions since
> causes the wait_for_completion() to terminate cleanly, BUT in order to
> cut-short also the busy-wait in the polling case we need that additional
> try_wait_for_completion()....so as not to spin forever for a message
> that we dont care anymore...
> [ note that any late arriving immediate-reply will be in teh future
> discarded at this point]
>
> For this reason, I think that this patch, while correctly checking the
> poll_done() condition when the spinloop terminates for the reason
> explained in the commit-message, it should also check if the loop was not
> instead forcibly terminated by the OoO scenario...if not, we will end up
> considering the polling to have timeout, while instead it was forcibly
> terminated by the OoO state machine complete() and just want to ignore
> such missing message...
>
> So what about instead of checking (untested):
>
> + if (!completion_done(&xfer->done) &&
> + !info->desc->ops->poll_done(cinfo, xfer))
Were you able to check this ? I am waiting to hear back from you on this.
--
Regards,
Sudeep