Re: [PATCH 2/2] mailbox: Error out early if the mbox driver has failed to submit the message
From: Evgeny Shatokhin
Date: Sun Sep 18 2022 - 12:33:22 EST
Thank you for a quick reply!
On 16.09.2022 20:04, Jassi Brar wrote:
On Thu, Sep 15, 2022 at 11:50 AM Evgenii Shatokhin
<e.shatokhin@xxxxxxxxx> wrote:
mbox_send_message() places the pointer to the message to the queue
(add_to_rbuf) then calls msg_submit(chan) to submit the first of the
queued messaged to the mailbox. Some of mailbox drivers can return
errors from their .send_data() callbacks, e.g., if the message is
invalid or there is something wrong with the mailbox device.
The message can't be invalid because the client code is written for a
particular provider.
As of mainline kernel v6.0-rc5, there are mailbox controller drivers
which check if the messages are valid in their .send_data() callbacks.
Example:
drivers/mailbox/rockchip-mailbox.c, rockchip_mbox_send_data():
if (msg->rx_size > mb->buf_size) {
dev_err(mb->mbox.dev, "Transmit size over buf size(%d)\n",
mb->buf_size);
return -EINVAL;
}
Other examples are zynqmp_ipi_send_data() from
drivers/mailbox/zynqmp-ipi-mailbox.c, ti_msgmgr_send_data() from
drivers/mailbox/ti-msgmgr.c, etc.
If this is incorrect and the controller drivers should not do such
things, I'd suggest to clearly state it in the docs, because it is far
from obvious from Documentation/driver-api/mailbox.rst at the moment.
That is, one could state that checking if the messages to be transmitted
are valid is a responsibility of the callers of mailbox API rather than
of the controller driver.
I could prepare such patch for the docs. Objections?
Though it is possible for the mailbox controller to break down for
some reason. In that case, the blocking api will keep retrying until
successful.
As far as I can see from the code, the behaviour seems to be different.
mbox_send_message() calls msg_submit() to send the message the first
time. If that fails, hrtimer is not armed, so there will be no attempts
to send the message again till tx_out ms pass:
err = chan->mbox->ops->send_data(chan, data);
if (!err) {
chan->active_req = data;
chan->msg_count--;
}
exit:
spin_unlock_irqrestore(&chan->lock, flags);
if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
/* kick start the timer immediately to avoid delays */
spin_lock_irqsave(&chan->mbox->poll_hrt_lock, flags);
hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
spin_unlock_irqrestore(&chan->mbox->poll_hrt_lock, flags);
}
This is from msg_submit(). Thus, the hrtimer will not fire, tx_tick()
will not be called until tx_out ms have passed, and no attempts to send
the message again will be made here.
In addition, complete(&chan->tx_complete) will not be called, so,
mbox_send_message() will have to needlessly wait whole tx_out ms.
Only after that, it will call tx_tick(chan, -ETIME), which will, in
turn, call msg_submit() to try to send the message again. If the mbox
has not recovered, sending will fail again.
Then, mbox_send_message() will exit with -ETIME. The pointer to the
message will remain in chan->msg_data[], but the framework will not
attempt to send it again until the client calls mbox_send_message() for
another, possibly unrelated message.
In this case, to sum up, mbox_send_message():
* needlessly waits for tx_out ms;
* only tries to send the message twice rather than makes retries until
successful;
* does not inform the client about the actual error happened, just
returns -ETIME;
* keeps the pointer to the message in chan->msg_data[], which is too
easy to overlook on the client side. Too easy for the client to, say,
reuse the structure and cause trouble.
What I suggest is to leave it to the client (or some other
provider-specific code using the client) what to do with the failures.
If the error is reported by the controller driver, don't wait in
mbox_send_message(), just pass the error to the client and exit. If the
client decides to ignore the error - OK, its problem. Or - it may kick
the mbox device somehow in a provider-specific way to make it work, or -
reset the channel, or - do anything else to make things work again.
The behaviour of mbox_send_message() would then become more consistent:
either it has sent the message successfully or it failed and returned an
error, without side-effects (like the pointer to that message kept in
the internal buffer).
I do not think this change would break the existing controller drivers
and client drivers.
What do you think?
But ideally the client, upon getting -ETIME, should free()
and request() the channel reset it (because controller drivers usually
don't contain the logic to automatically reset upon some error).
thanks.
Regards,
Evgenii