Re: [PATCH 02/13] gpu: nova-core: fsp: catch bogus queue pointer issues

From: Gary Guo

Date: Tue Jun 16 2026 - 06:57:58 EST


On Tue Jun 16, 2026 at 8:57 AM BST, Alistair Popple wrote:
> On 2026-06-16 at 03:15 +1000, Gary Guo <gary@xxxxxxxxxxx> wrote...
>> On Mon Jun 15, 2026 at 3:40 PM BST, Eliot Courtney wrote:
>> > Currently, `poll_msgq` will report a message of size 4 if the queue
>> > pointers are broken. It's easy to catch this if it occurs, so have
>> > `poll_msgq` return an error in this case.
>> >
>> > Signed-off-by: Eliot Courtney <ecourtney@xxxxxxxxxx>
>> > ---
>> > drivers/gpu/nova-core/falcon/fsp.rs | 15 +++++++++------
>> > 1 file changed, 9 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/drivers/gpu/nova-core/falcon/fsp.rs b/drivers/gpu/nova-core/falcon/fsp.rs
>> > index e7419a6e71e2..21eaa8e261ce 100644
>> > --- a/drivers/gpu/nova-core/falcon/fsp.rs
>> > +++ b/drivers/gpu/nova-core/falcon/fsp.rs
>> > @@ -107,19 +107,22 @@ fn read_emem(&mut self, bar: Bar0<'_>, data: &mut [u8]) -> Result {
>> > /// Poll FSP for incoming data.
>> > ///
>> > /// Returns the size of available data in bytes, or 0 if no data is available.
>> > + /// Returns an error if the queue pointers are bogus (`tail < head`).
>> > ///
>> > /// The FSP message queue is not circular. Pointers are reset to 0 after each
>> > /// message exchange, so `tail >= head` is always true when data is present.
>> > - fn poll_msgq(&self, bar: Bar0<'_>) -> u32 {
>> > + fn poll_msgq(&self, bar: Bar0<'_>) -> Result<u32> {
>> > let head = bar.read(regs::NV_PFSP_MSGQ_HEAD::at(0)).val();
>> > let tail = bar.read(regs::NV_PFSP_MSGQ_TAIL::at(0)).val();
>> >
>> > if head == tail {
>> > - return 0;
>> > + Ok(0)
>> > + } else {
>> > + // TAIL points at the last DWORD written, so the size is `tail - head + 4`.
>> > + tail.checked_sub(head)
>> > + .and_then(|delta| delta.checked_add(4))
>> > + .ok_or(EIO)
>>
>> Whenever we fail with this, we should print a message (actually, the same thing
>> probably should be done for patch 1 as well).
>>
>> A plain EIO is going be very difficult to troubleshoot if this is ever hit.
>
> I don't disagree with the sentiment - this is a problem through out the kernel
> and I have spent way too long tracing where exactly error codes have come from
> both in C and Rust.
>
> But it seems odd to worry about these particular instances - they _should_
> never happen or at least be extremely rare and very unlikely by an end-user.

I think we should either consider it possible and add prints, or consider it a
bug (hardware or driver) and add a `WARN_ON`, or consider it impossible and not
add failure path at all.

> More to the point though there are many other places in nova-core (and I'm
> sure other drivers) where this pattern of just returning a fairly generic
> error code exists. So it feels like it would be nicer to deal with this at some
> other layer, eg. some kind of debug option to tag error codes with location
> or something.

This should happen whenever the information is lost. Creating a generic error
code would be such occasion. Handling it at upper layers is okay if the
information is kept for longer. For example:

enum NovaError {
...
}

impl From<NovaError> for Error {
fn from(err: NovaError) -> Self {
// Print here
EIO
}
}

Would be fine to me because the information is emitted to user when it is lost
by the concrete error -> error code conversion.

Best,
Gary

>
> So I'm not opposed to the comment, but maybe it would be better addressed as a
> separate question/patch series to figure out how to do this error reporting in a
> more generic or consistent way across all of Nova at least?
>
> - Alistair