Re: [PATCH 03/13] gpu: nova-core: fsp: try to enforce exclusive access to FSP channel

From: Eliot Courtney

Date: Tue Jun 16 2026 - 22:56:49 EST

On Tue Jun 16, 2026 at 2:16 AM JST, Gary Guo wrote:
> On Mon Jun 15, 2026 at 3:40 PM BST, Eliot Courtney wrote:
>> Currently, `send_msg` assumes that the channel to FSP is free to write
>> into. But, it might not be. Both the kernel driver and GSP communicate
>> with FSP. The way they should attempt to keep exclusive access to this
>> channel to FSP is by making sure they don't try to start writing if
>> there's pending data until the full round trip has finished.
>>
>> Signed-off-by: Eliot Courtney <ecourtney@xxxxxxxxxx>
>> ---
>> drivers/gpu/nova-core/falcon/fsp.rs | 23 +++++++++++++++++++++++
>> 1 file changed, 23 insertions(+)
>>
>> diff --git a/drivers/gpu/nova-core/falcon/fsp.rs b/drivers/gpu/nova-core/falcon/fsp.rs
>> index 21eaa8e261ce..cdb476894e1a 100644
>> --- a/drivers/gpu/nova-core/falcon/fsp.rs
>> +++ b/drivers/gpu/nova-core/falcon/fsp.rs
>> @@ -125,6 +125,26 @@ fn poll_msgq(&self, bar: Bar0<'_>) -> Result<u32> {
>> }
>> }
>>
>> + /// Both the kernel driver and GSP talk to FSP. Try to ensure exclusive access to the FSP is
>> + /// enforced by making sure there is not a pending message already sent to FSP, and that there
>> + /// is no pending message from FSP to be read.
>> + fn wait_until_ready(&mut self, bar: Bar0<'_>) -> Result {
>> + read_poll_timeout(
>> + || {
>> + let qhead = bar.read(regs::NV_PFSP_QUEUE_HEAD::at(0)).address();
>> + let qtail = bar.read(regs::NV_PFSP_QUEUE_TAIL::at(0)).address();
>> + let mhead = bar.read(regs::NV_PFSP_MSGQ_HEAD::at(0)).val();
>> + let mtail = bar.read(regs::NV_PFSP_MSGQ_TAIL::at(0)).val();
>
> How does this prevent race between kernel and GSP when initiating FSP
> communcation?

You are right that it does not prevent it in the general case.

This is the logic that openrm uses and locally, much earlier, I observed
sometimes I did actually need this for reprobe to work. I think the
reason (but it's been a while) is that before we did not wait for GSP to
halt on unload and probe failure, which could mean there's a leftover
message from FSP that is not consumed if you reprobe quickly after a
failure. Now that we wait for GSP to halt it's much harder for this kind
of issue to occur. Actually, it might be impossible currently for this
kind of race to happen without a failure on unload (e.g. timeout of
waiting for GSP to reset).

The reason I thought to add this is more to match what appears to be
the protocol that this transport uses (even if it might not be sound
generally). I am curious what others think, if it's worth keeping this -
IMO it is since it does appear to be part of the way the communication
on the transport is meant to be done.

At least the comments could be better, since it looks like my hedging
"try to ensure" is confusing because it really doesn't in the general
case.

>
> Best,
> Gary
>
>> +
>> + Ok(qhead == qtail && mhead == mtail)
>> + },
>> + |&ready| ready,
>> + Delta::from_millis(10),
>> + Delta::from_millis(FSP_MSG_TIMEOUT_MS),
>> + )?;
>> + Ok(())
>> + }
>> +
>> /// Writes `packet` to FSP EMEM and updates the queue pointers to notify FSP.
>> ///
>> /// Returns `EINVAL` if `packet` is empty or its length is not 4-byte aligned.
>> @@ -133,6 +153,9 @@ pub(crate) fn send_msg(&mut self, bar: Bar0<'_>, packet: &[u8]) -> Result {
>> return Err(EINVAL);
>> }
>>
>> + // Try to make sure we have exclusive access to the FSP at this point.
>> + self.wait_until_ready(bar)?;
>> +
>> self.write_emem(bar, packet)?;
>>
>> // Update queue pointers. TAIL points at the last DWORD written.