Re: [PATCH 02/13] gpu: nova-core: fsp: catch bogus queue pointer issues

From: Alistair Popple

Date: Tue Jun 16 2026 - 03:58:31 EST


On 2026-06-16 at 03:15 +1000, Gary Guo <gary@xxxxxxxxxxx> wrote...
> On Mon Jun 15, 2026 at 3:40 PM BST, Eliot Courtney wrote:
> > Currently, `poll_msgq` will report a message of size 4 if the queue
> > pointers are broken. It's easy to catch this if it occurs, so have
> > `poll_msgq` return an error in this case.
> >
> > Signed-off-by: Eliot Courtney <ecourtney@xxxxxxxxxx>
> > ---
> > drivers/gpu/nova-core/falcon/fsp.rs | 15 +++++++++------
> > 1 file changed, 9 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/gpu/nova-core/falcon/fsp.rs b/drivers/gpu/nova-core/falcon/fsp.rs
> > index e7419a6e71e2..21eaa8e261ce 100644
> > --- a/drivers/gpu/nova-core/falcon/fsp.rs
> > +++ b/drivers/gpu/nova-core/falcon/fsp.rs
> > @@ -107,19 +107,22 @@ fn read_emem(&mut self, bar: Bar0<'_>, data: &mut [u8]) -> Result {
> > /// Poll FSP for incoming data.
> > ///
> > /// Returns the size of available data in bytes, or 0 if no data is available.
> > + /// Returns an error if the queue pointers are bogus (`tail < head`).
> > ///
> > /// The FSP message queue is not circular. Pointers are reset to 0 after each
> > /// message exchange, so `tail >= head` is always true when data is present.
> > - fn poll_msgq(&self, bar: Bar0<'_>) -> u32 {
> > + fn poll_msgq(&self, bar: Bar0<'_>) -> Result<u32> {
> > let head = bar.read(regs::NV_PFSP_MSGQ_HEAD::at(0)).val();
> > let tail = bar.read(regs::NV_PFSP_MSGQ_TAIL::at(0)).val();
> >
> > if head == tail {
> > - return 0;
> > + Ok(0)
> > + } else {
> > + // TAIL points at the last DWORD written, so the size is `tail - head + 4`.
> > + tail.checked_sub(head)
> > + .and_then(|delta| delta.checked_add(4))
> > + .ok_or(EIO)
>
> Whenever we fail with this, we should print a message (actually, the same thing
> probably should be done for patch 1 as well).
>
> A plain EIO is going be very difficult to troubleshoot if this is ever hit.

I don't disagree with the sentiment - this is a problem through out the kernel
and I have spent way too long tracing where exactly error codes have come from
both in C and Rust.

But it seems odd to worry about these particular instances - they _should_
never happen or at least be extremely rare and very unlikely by an end-user.
More to the point though there are many other places in nova-core (and I'm
sure other drivers) where this pattern of just returning a fairly generic
error code exists. So it feels like it would be nicer to deal with this at some
other layer, eg. some kind of debug option to tag error codes with location
or something.

So I'm not opposed to the comment, but maybe it would be better addressed as a
separate question/patch series to figure out how to do this error reporting in a
more generic or consistent way across all of Nova at least?

- Alistair

> Best,
> Gary
>
> > }
> > -
> > - // TAIL points at last DWORD written, so add 4 to get total size.
> > - tail.saturating_sub(head).saturating_add(4)
> > }
> >
> > /// Writes `packet` to FSP EMEM and updates the queue pointers to notify FSP.
> > @@ -154,7 +157,7 @@ pub(crate) fn send_msg(&mut self, bar: Bar0<'_>, packet: &[u8]) -> Result {
> > pub(crate) fn recv_msg(&mut self, bar: Bar0<'_>) -> Result<KVec<u8>> {
> > let result = (|| {
> > let msg_size = read_poll_timeout(
> > - || Ok(self.poll_msgq(bar)),
> > + || self.poll_msgq(bar),
> > |&size| size > 0,
> > Delta::from_millis(10),
> > Delta::from_millis(FSP_MSG_TIMEOUT_MS),
>
>