Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

From: Doug Anderson

Date: Tue Mar 10 2026 - 22:00:57 EST

Hi,

On Tue, Mar 10, 2026 at 5:46 PM Jassi Brar <jassisinghbrar@xxxxxxxxx> wrote:
>
> On Tue, Mar 10, 2026 at 7:15 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@xxxxxxxxx> wrote:
> > >
> > > On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@xxxxxxxxx> wrote:
> > > > >
> > > > > From: Jassi Brar <jassisinghbrar@xxxxxxxxx>
> > > > >
> > > > > The active_req field serves double duty as both the "is a TX in
> > > > > flight" flag (NULL means idle) and the storage for the in-flight
> > > > > message pointer. When a client sends NULL via mbox_send_message(),
> > > > > active_req is set to NULL, which the framework misinterprets as
> > > > > "no active request." This breaks the TX state machine by:
> > > > >
> > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > > > > callback and the tx_complete completion
> > > > > - txdone_hrtimer() skips the channel entirely since active_req
> > > > > is NULL, so poll-based TX-done detection never fires.
> > > > >
> > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > > > "no active request," freeing NULL to be valid message data. The
> > > > > sentinel is internal to the mailbox core and is never exposed to
> > > > > controller drivers or clients.
> > > > >
> > > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> > > > > by clients.
> > > > >
> > > > > Signed-off-by: Jassi Brar <jassisinghbrar@xxxxxxxxx>
> > > > > ---
> > > > > drivers/mailbox/mailbox.c | 15 +++++++++------
> > > > > 1 file changed, 9 insertions(+), 6 deletions(-)
> > > >
> > > > While I can certainly be corrected, I suspect this patch will break a
> > > > lot of users.
> > > >
> > > > My analysis of people that were passing NULL mbox messages is that
> > > > they _wanted_ the current behavior. They didn't want NULL messages to
> > > > be queued up as would happen with this patch. Instead, they just
> > > > wanted to immediately ring the doorbell again to signal an interrupt
> > > > to the other side.
> > > >
> > > They won't be queued up because they call mbox_client_txdone()
> > > immediately after mbox_send_message()
> > > From my analysis of the 14 clients, there was just one
> > > (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a
> > > patch.
> > >
> > > Can you please point me to the driver you are concerned about? Perhaps
> > > I am overlooking something.
> >
> > Ah, interesting! I hadn't thought about it that way...
> >
> > I think there are at least a few others that would need to change,
> > maybe? It looks like these ones:
> >
> > drivers/soc/xilinx/zynqmp_power.c
> > drivers/remoteproc/xlnx_r5_remoteproc.c
> The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c
> which sets
> txdone_poll = true and implements the .last_tx_done() callback.
> The client (zynqmp_power.c) should not call mbox_client_txdone()
>
> > drivers/firmware/imx/imx-dsp.c
> The underlying driver is drivers/mailbox/imx-mailbox.c which sets
> 'txdone_irq = true'
> and ticks the state machine with mbox_chan_txdone()
> Again the client need not call mbox_client_txdone()
>
> All three clients above are currently failed by the core which
> misinterprets NULL messages.
> This patch will fix such clients too, besides avoiding a new api.

...but doesn't that mean that the behavior of these clients will
change? Previously all calls to mbox_send_message() immediately called
through to the mailbox controller. Now, if the previous message hasn't
finished, the "NULL" messages will queue up.

For instance, let's say that our client calls mbox_send_message() 2
times in quick succession (10 us apart). Let's say that the remote
processor takes 1 ms to react.

Previously, the 2 calls to mbox_send_message() would probably be
coalesced. Each would call through to the mailbox controller, which
would make sure the doorbell was asserted. After 1 ms, the remote
processor would confirm the single doorbell it saw.

Now, the first call will ring the doorbell. The second one will queue
up. After 1 ms, the remote processor will confirm the doorbell, which
will cause the second doorbell to be sent.

Looking at the 3 drivers in question, I _guess_ maybe that situation
never comes up for them? So maybe they're all fine? I don't have tons
of confidence that I understand enough about these clients to say for
sure that this doesn't happen...

> > In my case, I have a mailbox driver that's currently downstream
> > (though I hope to change that). My mailbox controller has an interrupt
> > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > Some clients of this mailbox client want the txdone interrupt, but
> > some clients of it just care about sending doorbells.
> >
> So imx-dsp.c like? Please let me know what is lacking in the core to
> fully support that. Happy to look into it.

I know that with my downstream mailbox client, if I let NULL messages
queue up I end up with a queue of a dozen or so NULL messages. The
downstream client is really taking advantage (AKA abusing) the core's
current NULL behavior. It truly does want the "mbox_ring_doorbell"
concept of just making sure the doorbell is asserted and returning
immediately. If the core changes to start queuing NULL messages, we'll
have to figure out some sort of workaround...

-Doug