Re: [PATCH] RFC: mailbox: Fix NULL message support in

From: Joonwon Kang

Date: Thu Mar 12 2026 - 06:19:28 EST


> Hi,
>
> On Tue, Mar 10, 2026 at 5:46 PM Jassi Brar <jassisinghbrar@xxxxxxxxx> wrote:
> >
> > On Tue, Mar 10, 2026 at 7:15 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@xxxxxxxxx> wrote:
> > > > > >
> > > > > > From: Jassi Brar <jassisinghbrar@xxxxxxxxx>
> > > > > >
> > > > > > The active_req field serves double duty as both the "is a TX in
> > > > > > flight" flag (NULL means idle) and the storage for the in-flight
> > > > > > message pointer. When a client sends NULL via mbox_send_message(),
> > > > > > active_req is set to NULL, which the framework misinterprets as
> > > > > > "no active request." This breaks the TX state machine by:
> > > > > >
> > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > > > > > callback and the tx_complete completion
> > > > > > - txdone_hrtimer() skips the channel entirely since active_req
> > > > > > is NULL, so poll-based TX-done detection never fires.
> > > > > >
> > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > > > > "no active request," freeing NULL to be valid message data. The
> > > > > > sentinel is internal to the mailbox core and is never exposed to
> > > > > > controller drivers or clients.
> > > > > >
> > > > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> > > > > > by clients.
> > > > > >
> > > > > > Signed-off-by: Jassi Brar <jassisinghbrar@xxxxxxxxx>
> > > > > > ---
> > > > > > drivers/mailbox/mailbox.c | 15 +++++++++------
> > > > > > 1 file changed, 9 insertions(+), 6 deletions(-)
> > > > >
> > > > > While I can certainly be corrected, I suspect this patch will break a
> > > > > lot of users.
> > > > >
> > > > > My analysis of people that were passing NULL mbox messages is that
> > > > > they _wanted_ the current behavior. They didn't want NULL messages to
> > > > > be queued up as would happen with this patch. Instead, they just
> > > > > wanted to immediately ring the doorbell again to signal an interrupt
> > > > > to the other side.
> > > > >
> > > > They won't be queued up because they call mbox_client_txdone()
> > > > immediately after mbox_send_message()
> > > > From my analysis of the 14 clients, there was just one
> > > > (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a
> > > > patch.
> > > >
> > > > Can you please point me to the driver you are concerned about? Perhaps
> > > > I am overlooking something.
> > >
> > > Ah, interesting! I hadn't thought about it that way...
> > >
> > > I think there are at least a few others that would need to change,
> > > maybe? It looks like these ones:
> > >
> > > drivers/soc/xilinx/zynqmp_power.c
> > > drivers/remoteproc/xlnx_r5_remoteproc.c
> > The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c
> > which sets
> > txdone_poll = true and implements the .last_tx_done() callback.
> > The client (zynqmp_power.c) should not call mbox_client_txdone()
> >
> > > drivers/firmware/imx/imx-dsp.c
> > The underlying driver is drivers/mailbox/imx-mailbox.c which sets
> > 'txdone_irq = true'
> > and ticks the state machine with mbox_chan_txdone()
> > Again the client need not call mbox_client_txdone()
> >
> > All three clients above are currently failed by the core which
> > misinterprets NULL messages.
> > This patch will fix such clients too, besides avoiding a new api.
>
> ...but doesn't that mean that the behavior of these clients will
> change? Previously all calls to mbox_send_message() immediately called
> through to the mailbox controller. Now, if the previous message hasn't
> finished, the "NULL" messages will queue up.
>
> For instance, let's say that our client calls mbox_send_message() 2
> times in quick succession (10 us apart). Let's say that the remote
> processor takes 1 ms to react.
>
> Previously, the 2 calls to mbox_send_message() would probably be
> coalesced. Each would call through to the mailbox controller, which
> would make sure the doorbell was asserted. After 1 ms, the remote
> processor would confirm the single doorbell it saw.
>
> Now, the first call will ring the doorbell. The second one will queue
> up. After 1 ms, the remote processor will confirm the doorbell, which
> will cause the second doorbell to be sent.
>
>
> Looking at the 3 drivers in question, I _guess_ maybe that situation
> never comes up for them? So maybe they're all fine? I don't have tons
> of confidence that I understand enough about these clients to say for
> sure that this doesn't happen...
>

If multiple threads are ringing the doorbell through the same channel many
times within a very short period of time, I think the issue you are
concerned about could occur in theory with this patch even if the doorbell
ack could be ignored. If that is the case, I believe it will be clients to
handle the send failure, e.g. by retrying.

>
> > > In my case, I have a mailbox driver that's currently downstream
> > > (though I hope to change that). My mailbox controller has an interrupt
> > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > > Some clients of this mailbox client want the txdone interrupt, but
> > > some clients of it just care about sending doorbells.
> > >
> > So imx-dsp.c like? Please let me know what is lacking in the core to
> > fully support that. Happy to look into it.
>
> I know that with my downstream mailbox client, if I let NULL messages
> queue up I end up with a queue of a dozen or so NULL messages. The
> downstream client is really taking advantage (AKA abusing) the core's
> current NULL behavior. It truly does want the "mbox_ring_doorbell"
> concept of just making sure the doorbell is asserted and returning
> immediately. If the core changes to start queuing NULL messages, we'll
> have to figure out some sort of workaround...

According to the Jassi's comment on https://lore.kernel.org/all/CABb+yY18OEvfc8DDUiZqVeQtkmwcOFCSTMT7KoXb1LVA3RuxdA@xxxxxxxxxxxxxx/
"A controller driver is supposed to either expect data or not, but not both",
client drivers should send to a controller either non-NULL or NULL data,
not both in a mixed fashion.

With this setup, if the doorbell acknowledgement from the remote could be
ignored by the controller for your case, I believe you could separate a
controller for doorbell from the one for non-doorbell and set the
controller as follows.
```
mbox_controller->txdone_poll = true;
mbox_controller->txpoll_period = 0;
mbox_controller->ops->last_tx_done = ... just returns true ...
```
Then it will dequeue the NULL messages as fast as the hrtimer's resolution
interval. Or, you could just use mbox_send_message() and
mbox_client_txdone() pair.

On the other hand, if the doorbell acknowledgement is to be respected by
the controller(this is my case), the controller should wait until the
ongoing doorbell tx is successfully done or not, and this patch could do
it whereas mbox_ring_doorbell() couldn't.