Re: [PATCH 2/2] mailbox: Error out early if the mbox driver has failed to submit the message

From: Evgeny Shatokhin
Date: Sun Sep 18 2022 - 12:33:22 EST


Thank you for a quick reply!

On 16.09.2022 20:04, Jassi Brar wrote:
On Thu, Sep 15, 2022 at 11:50 AM Evgenii Shatokhin
<e.shatokhin@xxxxxxxxx> wrote:

mbox_send_message() places the pointer to the message to the queue
(add_to_rbuf) then calls msg_submit(chan) to submit the first of the
queued messaged to the mailbox. Some of mailbox drivers can return
errors from their .send_data() callbacks, e.g., if the message is
invalid or there is something wrong with the mailbox device.

The message can't be invalid because the client code is written for a
particular provider.

As of mainline kernel v6.0-rc5, there are mailbox controller drivers which check if the messages are valid in their .send_data() callbacks. Example:

drivers/mailbox/rockchip-mailbox.c, rockchip_mbox_send_data():
if (msg->rx_size > mb->buf_size) {
dev_err(mb->mbox.dev, "Transmit size over buf size(%d)\n",
mb->buf_size);
return -EINVAL;
}

Other examples are zynqmp_ipi_send_data() from drivers/mailbox/zynqmp-ipi-mailbox.c, ti_msgmgr_send_data() from drivers/mailbox/ti-msgmgr.c, etc.

If this is incorrect and the controller drivers should not do such things, I'd suggest to clearly state it in the docs, because it is far from obvious from Documentation/driver-api/mailbox.rst at the moment.

That is, one could state that checking if the messages to be transmitted are valid is a responsibility of the callers of mailbox API rather than of the controller driver.

I could prepare such patch for the docs. Objections?


Though it is possible for the mailbox controller to break down for
some reason. In that case, the blocking api will keep retrying until
successful.

As far as I can see from the code, the behaviour seems to be different.

mbox_send_message() calls msg_submit() to send the message the first time. If that fails, hrtimer is not armed, so there will be no attempts to send the message again till tx_out ms pass:

err = chan->mbox->ops->send_data(chan, data);
if (!err) {
chan->active_req = data;
chan->msg_count--;
}
exit:
spin_unlock_irqrestore(&chan->lock, flags);

if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
/* kick start the timer immediately to avoid delays */
spin_lock_irqsave(&chan->mbox->poll_hrt_lock, flags);
hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
spin_unlock_irqrestore(&chan->mbox->poll_hrt_lock, flags);
}

This is from msg_submit(). Thus, the hrtimer will not fire, tx_tick() will not be called until tx_out ms have passed, and no attempts to send the message again will be made here.

In addition, complete(&chan->tx_complete) will not be called, so, mbox_send_message() will have to needlessly wait whole tx_out ms.
Only after that, it will call tx_tick(chan, -ETIME), which will, in turn, call msg_submit() to try to send the message again. If the mbox has not recovered, sending will fail again.

Then, mbox_send_message() will exit with -ETIME. The pointer to the message will remain in chan->msg_data[], but the framework will not attempt to send it again until the client calls mbox_send_message() for another, possibly unrelated message.

In this case, to sum up, mbox_send_message():
* needlessly waits for tx_out ms;
* only tries to send the message twice rather than makes retries until successful;
* does not inform the client about the actual error happened, just returns -ETIME;
* keeps the pointer to the message in chan->msg_data[], which is too easy to overlook on the client side. Too easy for the client to, say, reuse the structure and cause trouble.

What I suggest is to leave it to the client (or some other provider-specific code using the client) what to do with the failures.

If the error is reported by the controller driver, don't wait in mbox_send_message(), just pass the error to the client and exit. If the client decides to ignore the error - OK, its problem. Or - it may kick the mbox device somehow in a provider-specific way to make it work, or - reset the channel, or - do anything else to make things work again.

The behaviour of mbox_send_message() would then become more consistent: either it has sent the message successfully or it failed and returned an error, without side-effects (like the pointer to that message kept in the internal buffer).

I do not think this change would break the existing controller drivers and client drivers.

What do you think?

But ideally the client, upon getting -ETIME, should free()
and request() the channel reset it (because controller drivers usually
don't contain the logic to automatically reset upon some error).

thanks.

Regards,
Evgenii