Re: [PATCH v3] firmware: arm_scmi: Free mailbox channels if probe fails

From: Cristian Marussi
Date: Sun Nov 07 2021 - 13:22:21 EST


On Sun, Nov 07, 2021 at 10:34:07AM +0000, Cristian Marussi wrote:
> On Fri, Nov 05, 2021 at 10:40:59AM -0700, rishabhb@xxxxxxxxxxxxxx wrote:
> > On 2021-11-05 02:43, Cristian Marussi wrote:
> > > On Thu, Nov 04, 2021 at 04:40:03PM -0700, rishabhb@xxxxxxxxxxxxxx wrote:
> > > > On 2021-11-02 04:32, Sudeep Holla wrote:
> > > > > On Mon, Nov 01, 2021 at 09:35:42AM -0700, rishabhb@xxxxxxxxxxxxxx wrote:
> > > > > > On 2021-09-01 02:35, Cristian Marussi wrote:
> > > > > > > On Tue, Aug 31, 2021 at 06:48:35AM +0100, Cristian Marussi wrote:
> > > > > > > > On Mon, Aug 30, 2021 at 02:09:37PM -0700, rishabhb@xxxxxxxxxxxxxx
> > > > > > > > wrote:
> > > > > > > > > Hi Christian
> > > > > > > >
> > > > > > > > Hi Rishabh,
> > > > > > > >
> > >
> > > Hi Rishabh,
> > >
>
> Hi Rishabh,
>

Hi Rishabhb,

> > > apologies for the delay in coming back to you.
> > > A few comments below.
> > >
> > > > > > > > thanks for looking into this kind of bad interactions.
> > > > > > > >
> > > > > > > > > There seems to be another issue here. The response from agent can be delayed
> > > > > > > > > causing a timeout during base protocol acquire,
> > > > > > > > > which leads to the probe failure. What I have observed is sometimes the
> > > > > > > > > failure of probe and rx_callback (due to a delayed message)
> > > > > > > > > happens at the same time on different cpus.
> > > > > > > > > Because of this race, the device memory may be cleared while the
> > > > > > > > > interrupt(rx_callback) is executing on another cpu.
> > > > > > > >
> > > > > > > > You are right that concurrency was not handled properly in this kind
> > > > > > > > of
> > > > > > > > context and moreover, if you think about it, even the case of out of
> > > > > > > > order reception of responses and delayed_responses (type2 SCMI
> > > > > > > > messages)
> > > > > > > > for asynchronous SCMI commands was not handled properly.
> > > > > > > >
> > > > > > > > > How do you propose we solve this? Do you think it is better to take the
> > > > > > > > > setting up of base and other protocols out of probe and
> > > > > > > > > in some delayed work? That would imply the device memory is not released
> > > > > > > > > until remove is called. Or should we add locking to
> > > > > > > > > the interrupt handler(scmi_rx_callback) and the cleanup in probe to avoid
> > > > > > > > > the race?
> > > > > > > > >
> > > > > > > >
> > > > > > > > These issues were more easily exposed by SCMI Virtio transport, so in
> > > > > > > > the series where I introduced scmi-virtio:
> > > > > > > >
> > > > > > > > https://lore.kernel.org/linux-arm-kernel/162848483974.232214.9506203742448269364.b4-ty@xxxxxxx/
> > > > > > > >
> > > > > > > > (which is now queued for v5.15 ... now on -next I think...finger
> > > > > > > > crossed)
> > > > > > > >
> > > > > > > > I took the chance to rectify a couple of other things in the SCMI core
> > > > > > > > in the initial commits.
> > > > > > > > As an example, in the above series
> > > > > > > >
> > > > > > > > [PATCH v7 05/15] firmware: arm_scmi: Handle concurrent and
> > > > > > > > out-of-order messages
> > > > > > > >
> > > > > > > > cares to add a refcount to xfers and some locking on xfers between TX
> > > > > > > > and RX path to avoid that a timed out xfer can vanish while the rx
> > > > > > > > path
> > > > > > > > is concurrently working on it (as you said); moreover I handle the
> > > > > > > > condition (rare if not unplausible anyway) in which a transport
> > > > > > > > delivers
> > > > > > > > out of order responses and delayed responses.
> > > > > > > >
> > > > > > > > I tested this scenarios on some fake emulated SCMI Virtio transport
> > > > > > > > where I could play any sort of mess and tricks to stress this limit
> > > > > > > > conditions, but you're more than welcome to verify if the race you are
> > > > > > > > seeing on Base protocol time out is solved (as I would hope :D) by
> > > > > > > > this
> > > > > > > > series of mine.
> > > > > > > >
> > > > > > > > Let me know, any feedback is welcome.
> > > > > > > >
> > > > > > > > Btw, in the series above there are also other minor changes, but there
> > > > > > > > is also another more radical change needed to ensure correctness and
> > > > > > > > protection against stale old messages which maybe could interest you
> > > > > > > > in general if you are looking into SCMI:
> > > > > > > >
> > > > > > > > [PATCH v7 04/15] firmware: arm_scmi: Introduce monotonically
> > > > > > > > increasing tokens
> > > > > > > >
> > > > > > > > Let me know if yo have other concerns.
> > > > > > > >
> > > > > > >
> > > > > > > Hi Rishabhb,
> > > > > > >
> > > > > > > just a quick remark, thinking again about your fail @probe scenario
> > > > > > > above
> > > > > > > I realized that while the concurrency patch I mentioned above could help
> > > > > > > on
> > > > > > > races against vanishing xfers when late timed-out responses are
> > > > > > > delivered,
> > > > > > > here we really are then also shutting down everything on failure, so
> > > > > > > there
> > > > > > > could be further issues between a very late invokation of
> > > > > > > scmi_rx_callback
> > > > > > > and the core devm_ helpers freeing the underlying xfer/cinfo/etc..
> > > > > > > structs
> > > > > > > used by scmi-rx-callback itself (maybe this was already what you meant
> > > > > > > and
> > > > > > > I didn't get it,...sorry)
> > > > > > >
> > > > > > > On the other side, I don't feel that delaying Base init to a deferred
> > > > > > > worker is a viable solution since we need Base protocol init to be
> > > > > > > initialized and we need to just give up if we cannot communicate with
> > > > > > > the SCMI platform fw in such early stages. (Base protocol is really the
> > > > > > > only mandatory proto is I remember correctly the spec)
> > > > > > >
> > > > > > > Currenly I'm off and only glancing at mails but I'll have a thought
> > > > > > > about
> > > > > > > these issues once back in a few weeks time.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Cristian
> > > > > > >
> > > > > > Hi Cristian
> > > > > > I hope you enjoyed your vacation. Did you get a chance to look at
> > > > > > the issue
> > > > > > stated above and have some idea as to how to solve this?
> > > > >
> > > > > Do you still see the issue with v5.15 ? Can you please check if haven't
> > > > > already done that ?
> > > > >
> > > > > Also 30ms delay we have is huge IMO and we typically expect the
> > > > > communication
> > > > > with remote processor or any entity that implements SCMI to happen in
> > > > > terms
> > > > > of one or few ms tops.
> > > > >
> > > > > If there is a race, we need to fix that but I am interested in knowing
> > > > > why the default time of 30ms not sufficient ? Did increasing that helps
> > > > > and is this timeout happening only for the initial commands(guessing the
> > > > > SCMI firmware is not yet ready) or does it happen even during run-time ?
> > > >
> > > > Hi Sudeep
> > > > I haven't checked on 5.15 but after glancing at the code I believe
> > > > we should
> > > > see the same issue.
> > > > I agree 30ms is a big enough value and should be something that remote
> > > > firmware should resolve. But
> > > > if remote firmware goes into a bad state and not functioning
> > > > properly at
> > > > least kernel should not panic.
> > > >
> > > > The issue we see here happens during scmi probe. The response from the
> > > > remote agent can be delayed
> > > > causing a timeout during base protocol acquire, which leads to the
> > > > probe
> > > > failure.
> > > > What I have observed is sometimes the failure of probe and
> > > > rx_callback (due
> > > > to a delayed message)
> > > > happens around the same time on different cpus. Because of this
> > > > race, the
> > > > device memory may be cleared
> > > > while the interrupt(rx_callback) is executing on another cpu.
> > >
> > > So I was looking at the failure path you mentioned: a late concurrent
> > > reply on Base protocol from the fw, during the probe, leads to an
> > > invocation
> > > of scmi_rx_callback() on a different CPU while core data structs like
> > > cinfo are being freed by the SCMI core on the probe failure path.
> > > (v5.15-added SCMI concurrrency handling stuff I mentiond shuld help for
> > > races regarding xfer but not for the cinfo stuff in this case ...)
> > >
> > > We cannot defer Base proto init since we just wanna fail early while
> > > probing if not even the Base protocol can work fine, and also because
> > > Base protocol information are indeed needed for initial setup, so we
> > > cannot juts proceed if we did not even got a Base reply on the number of
> > > protos. (already said)
> > >
> > > In my opinion, the proper way to address this kind of races at probe
> > > failure should be to ensure that the transport you are using is properly
> > > shut down completely before cleanup starts (same applies for a clean
> > > remove), i.e. scmi_rx_callback should not even be possibly registered to
> > > be called when the the final cleanup by the core is started (devm_ frees
> > > I mean after scmi_probe exit failing...)
> > >
> > > BUT indeed looking back at transport layers like mailbox and virtio,
> > > this
> > > should be happening already, because the flow is like
> > >
> > > scmi_probe()
> > > {
> > > ...
> > >
> > > clean_tx_rx_setup:
> > > scmi_cleanup_txrx_channels()
> > > ....
> > > --->>> ret = idr_for_each(idr, info->desc->ops->chan_free, idr);
> > > -
> > > return ret;
> > > }
> > >
> > > .... only after this scmi_probe returns the core devm layer starts
> > > freeing devm_
> > > allocated stuff like cinfo, AND the above per-transport specific
> > > .chan_free seems
> > > to take care to 'deactivate/dregister' the scmi_rx_callback at the
> > > transport layer:
> > >
> > >
> > > e.g. MBOX transport
> > > -------------------------
> > > static int mailbox_chan_free(int id, void *p, void *data)
> > > {
> > > struct scmi_chan_info *cinfo = p;
> > > struct scmi_mailbox *smbox = cinfo->transport_info;
> > >
> > > if (smbox && !IS_ERR(smbox->chan)) {
> > > mbox_free_channel(smbox->chan); <<< THIS MBOX CORE CALL DEACTIVATE
> > > cinfo->transport_info = NULL;
> > >
> > >
> > > e.g. VIRTIO Transport
> > > -----------------------------
> > > static int virtio_chan_free(int id, void *p, void *data)
> > > {
> > > unsigned long flags;
> > > struct scmi_chan_info *cinfo = p;
> > > struct scmi_vio_channel *vioch = cinfo->transport_info;
> > >
> > > spin_lock_irqsave(&vioch->ready_lock, flags);
> > > vioch->ready = false; <<<< THIS VIRTIO FLAG
> > > DEACTIVATE VIRTIO CBS INVOKCATION
> > > spin_unlock_irqrestore(&vioch->ready_lock, flags);
> > >
> > >
> > > ... AND both of the above call are indeed also spinlocked heavily, so
> > > that
> > > the 'deactivation' of the scmi_rx_callback should be visible properly;
> > > in
> > > other words I would expect that after the above .chan_free() have
> > > completed the scmi_rx_callback() cannot be called anymore, because the
> > > transport itself will properly drop any so-late fw reply.
> > >
> > > So I am now wondering, which transport are you using in your tests ?
> > > since at least for the above 2 example it seems to me that your
> > > race-on-probe failure condition should be already addressed by the
> > > transport layer itself....or am I getting wrong the nature of the race ?
> > >
> > > Thanks
> > > Cristian
> >
> > Hi Cristian
> > You caught the scenario perfectly. But there is still a possibility of a
> > race. To be clear we use
> > the mbox transport. Let me explain in more detail.
> > Lets assume that the last command (base protocol acquire) kernel sent to
> > remote agent timed out.
> > This will lead to final cleanup before exiting probe like you mentioned.
> > Once cleanup is done(mailbox_chan_free)
> > no more responses from remote agent will acknowledged but if the response
> > comes in between the cleanup in probe
> > and the last command timing out we will see a race since the response can
> > come asynchronously. In this scenario cleanup
> > and scmi_rx_callback race with each other.
> > I believe to solve this we need to synchronize cleanup with
> > scmi_rx_callback. we can serialize these two paths
> > and exit early in rx_callback if cleanup has been completed.
> >
>
> Yes indeed, but my concern is also not to introduce to much contention
> on the RX path (with irqsave spinlocking & friends), given that this racy
> scenario has surely to be handled but it is also highly unlikely, so I don't
> want to slow down all the rx path all the time.
>
> So I tried something along this lines:
>
> ----8<------
> diff --git a/drivers/firmware/arm_scmi/common.h b/drivers/firmware/arm_scmi/common.h
> index dea1bfbe1052..036f8ccff450 100644
> --- a/drivers/firmware/arm_scmi/common.h
> +++ b/drivers/firmware/arm_scmi/common.h
> @@ -340,11 +340,13 @@ void scmi_protocol_release(const struct scmi_handle *handle, u8 protocol_id);
> * channel
> * @handle: Pointer to SCMI entity handle
> * @transport_info: Transport layer related information
> + * @users: A refcount to track active users of this channel
> */
> struct scmi_chan_info {
> struct device *dev;
> struct scmi_handle *handle;
> void *transport_info;
> + refcount_t users;
> };
>
> /**
> diff --git a/drivers/firmware/arm_scmi/driver.c b/drivers/firmware/arm_scmi/driver.c
> index b406b3f78f46..5814ed3f444e 100644
> --- a/drivers/firmware/arm_scmi/driver.c
> +++ b/drivers/firmware/arm_scmi/driver.c
> @@ -678,6 +678,16 @@ static void scmi_handle_response(struct scmi_chan_info *cinfo,
> scmi_xfer_command_release(info, xfer);
> }
>
> +static inline bool scmi_acquire_channel(struct scmi_chan_info *cinfo)
> +{
> + return refcount_inc_not_zero(&cinfo->users);
> +}
> +
> +static inline void scmi_release_channel(struct scmi_chan_info *cinfo)
> +{
> + return refcount_dec(&cinfo->users);
> +}
> +
> /**
> * scmi_rx_callback() - callback for receiving messages
> *
> @@ -695,6 +705,10 @@ void scmi_rx_callback(struct scmi_chan_info *cinfo, u32 msg_hdr, void *priv)
> {
> u8 msg_type = MSG_XTRACT_TYPE(msg_hdr);
>
> + /* Bail out if channel freed already */
> + if (!scmi_acquire_channel(cinfo))
> + return;
> +
> switch (msg_type) {
> case MSG_TYPE_NOTIFICATION:
> scmi_handle_notification(cinfo, msg_hdr, priv);
> @@ -707,6 +721,8 @@ void scmi_rx_callback(struct scmi_chan_info *cinfo, u32 msg_hdr, void *priv)
> WARN_ONCE(1, "received unknown msg_type:%d\n", msg_type);
> break;
> }
> +
> + scmi_release_channel(cinfo);
> }
>
> /**
> @@ -1506,10 +1522,27 @@ static int scmi_chan_setup(struct scmi_info *info, struct device *dev,
> return ret;
> }
>
> + refcount_set(&cinfo->users, 1);
> cinfo->handle = &info->handle;
> return 0;
> }
>
> +static int scmi_chan_free(int id, void *p, void *data)
> +{
> + struct scmi_chan_info *cinfo = p;
> + struct scmi_info *info = handle_to_scmi_info(cinfo->handle);
> +
> + if (refcount_dec_and_test(&cinfo->users)) {
> + info->desc->ops->chan_free(id, cinfo, data);
> + } else {
> + /* Stall till the ongoing rx_callback completes */
> + spin_until_cond(refcount_read(&cinfo->users) == 0);
> + info->desc->ops->chan_free(id, cinfo, data);
> + }
> +
> + return 0;
> +}
> +
> static inline int
> scmi_txrx_setup(struct scmi_info *info, struct device *dev, int prot_id)
> {
> @@ -1792,11 +1825,11 @@ static int scmi_cleanup_txrx_channels(struct scmi_info *info)
> int ret;
> struct idr *idr = &info->tx_idr;
>
> - ret = idr_for_each(idr, info->desc->ops->chan_free, idr);
> + ret = idr_for_each(idr, scmi_chan_free, idr);
> idr_destroy(&info->tx_idr);
>
> idr = &info->rx_idr;
> - ret = idr_for_each(idr, info->desc->ops->chan_free, idr);
> + ret = idr_for_each(idr, scmi_chan_free, idr);
> idr_destroy(&info->rx_idr);
>
> return ret;
>
> ------8<-----
>
> Can you give it a go on your setup ?
>
> Beware it is not really tested on the racy error path (:P) and I could have
> still missed something regarding synchro (and I expect an undesired refcount
> warn on the scmi_release_channel too when the race is hit....but just to
> experiment a bit for now and see if something like this could be enough while
> avoiding further locking)
> o

Looking back at this patch of mine, even though it could work for the racy issue at
hand, it is currently clearly completely broken on the regular unload/free
flow since cinfo structs can be re-used multiple times.
Sorry, please ignore this attempt, I'll rework in a more sensible way.

Thanks,
Cristian