Re: [PATCH 02/13] gpu: nova-core: fsp: catch bogus queue pointer issues
From: Eliot Courtney
Date: Tue Jun 16 2026 - 23:51:35 EST
On Tue Jun 16, 2026 at 7:57 PM JST, Gary Guo wrote:
> On Tue Jun 16, 2026 at 8:57 AM BST, Alistair Popple wrote:
>> On 2026-06-16 at 03:15 +1000, Gary Guo <gary@xxxxxxxxxxx> wrote...
>>> On Mon Jun 15, 2026 at 3:40 PM BST, Eliot Courtney wrote:
>>> > Currently, `poll_msgq` will report a message of size 4 if the queue
>>> > pointers are broken. It's easy to catch this if it occurs, so have
>>> > `poll_msgq` return an error in this case.
>>> >
>>> > Signed-off-by: Eliot Courtney <ecourtney@xxxxxxxxxx>
>>> > ---
>>> > drivers/gpu/nova-core/falcon/fsp.rs | 15 +++++++++------
>>> > 1 file changed, 9 insertions(+), 6 deletions(-)
>>> >
>>> > diff --git a/drivers/gpu/nova-core/falcon/fsp.rs b/drivers/gpu/nova-core/falcon/fsp.rs
>>> > index e7419a6e71e2..21eaa8e261ce 100644
>>> > --- a/drivers/gpu/nova-core/falcon/fsp.rs
>>> > +++ b/drivers/gpu/nova-core/falcon/fsp.rs
>>> > @@ -107,19 +107,22 @@ fn read_emem(&mut self, bar: Bar0<'_>, data: &mut [u8]) -> Result {
>>> > /// Poll FSP for incoming data.
>>> > ///
>>> > /// Returns the size of available data in bytes, or 0 if no data is available.
>>> > + /// Returns an error if the queue pointers are bogus (`tail < head`).
>>> > ///
>>> > /// The FSP message queue is not circular. Pointers are reset to 0 after each
>>> > /// message exchange, so `tail >= head` is always true when data is present.
>>> > - fn poll_msgq(&self, bar: Bar0<'_>) -> u32 {
>>> > + fn poll_msgq(&self, bar: Bar0<'_>) -> Result<u32> {
>>> > let head = bar.read(regs::NV_PFSP_MSGQ_HEAD::at(0)).val();
>>> > let tail = bar.read(regs::NV_PFSP_MSGQ_TAIL::at(0)).val();
>>> >
>>> > if head == tail {
>>> > - return 0;
>>> > + Ok(0)
>>> > + } else {
>>> > + // TAIL points at the last DWORD written, so the size is `tail - head + 4`.
>>> > + tail.checked_sub(head)
>>> > + .and_then(|delta| delta.checked_add(4))
>>> > + .ok_or(EIO)
>>>
>>> Whenever we fail with this, we should print a message (actually, the same thing
>>> probably should be done for patch 1 as well).
>>>
>>> A plain EIO is going be very difficult to troubleshoot if this is ever hit.
>>
>> I don't disagree with the sentiment - this is a problem through out the kernel
>> and I have spent way too long tracing where exactly error codes have come from
>> both in C and Rust.
>>
>> But it seems odd to worry about these particular instances - they _should_
>> never happen or at least be extremely rare and very unlikely by an end-user.
>
> I think we should either consider it possible and add prints, or consider it a
> bug (hardware or driver) and add a `WARN_ON`, or consider it impossible and not
> add failure path at all.
I am new to this, but IIUC it's normal to guard against
hardware/firmware bugs - currently, for this error case, we would try to
create the `FspResponse` in `send_sync_fsp` and print an error "FSP
response too small: 4" without this patch. With this patch, we would
output "FSP response error: EIO\n". I think the latter is marginally
easier to debug, but both you would find it pretty quickly looking at
the code.
I am not that opposed to adding a print, but it seems low value to me,
since it's very unlikely this will occur and we have a print in the only
calling function right now (`send_sync_fsp`).
I think the semantics are clearer with this patch (no need to reason
about what happens when the saturating sub + add are hit).
>
>> More to the point though there are many other places in nova-core (and I'm
>> sure other drivers) where this pattern of just returning a fairly generic
>> error code exists. So it feels like it would be nicer to deal with this at some
>> other layer, eg. some kind of debug option to tag error codes with location
>> or something.
>
> This should happen whenever the information is lost. Creating a generic error
> code would be such occasion. Handling it at upper layers is okay if the
> information is kept for longer. For example:
>
> enum NovaError {
> ...
> }
>
> impl From<NovaError> for Error {
> fn from(err: NovaError) -> Self {
> // Print here
> EIO
> }
> }
>
> Would be fine to me because the information is emitted to user when it is lost
> by the concrete error -> error code conversion.
It feels very odd to me to have a From implementation have side effects
(printing) although I agree semantically with the "information is lost"
idea.
>
> Best,
> Gary
>
>>
>> So I'm not opposed to the comment, but maybe it would be better addressed as a
>> separate question/patch series to figure out how to do this error reporting in a
>> more generic or consistent way across all of Nova at least?
>>
>> - Alistair