Re: [PATCH v2 4/5] gpu: nova-core: send UNLOADING_GUEST_DRIVER GSP command upon unloading
From: Alexandre Courbot
Date: Tue Apr 21 2026 - 10:32:48 EST
On Tue Apr 21, 2026 at 6:42 PM JST, Eliot Courtney wrote:
> On Tue Apr 21, 2026 at 3:16 PM JST, Alexandre Courbot wrote:
>> Currently, the GSP is left running after the driver is unbound. This is
>> not great for several reasons, notably that it can still access shared
>> memory areas that the kernel will now reclaim (especially problematic on
>> setups without an IOMMU).
>>
>> Fix this by sending the `UNLOADING_GUEST_DRIVER` GSP command when
>> unbinding. This stops the GSP and lets us proceed with the rest of the
>> unbind sequence in the next patch.
>>
>> Signed-off-by: Alexandre Courbot <acourbot@xxxxxxxxxx>
>> ---
>> drivers/gpu/nova-core/gpu.rs | 5 +++
>> drivers/gpu/nova-core/gsp/boot.rs | 40 +++++++++++++++++++++++
>> drivers/gpu/nova-core/gsp/commands.rs | 36 ++++++++++++++++++++
>> drivers/gpu/nova-core/gsp/fw.rs | 4 +++
>> drivers/gpu/nova-core/gsp/fw/commands.rs | 23 +++++++++++++
>> drivers/gpu/nova-core/gsp/fw/r570_144/bindings.rs | 8 +++++
>> 6 files changed, 116 insertions(+)
>>
>> diff --git a/drivers/gpu/nova-core/gpu.rs b/drivers/gpu/nova-core/gpu.rs
>> index 1701c2600538..8f2ae9e8a519 100644
>> --- a/drivers/gpu/nova-core/gpu.rs
>> +++ b/drivers/gpu/nova-core/gpu.rs
>> @@ -277,12 +277,17 @@ pub(crate) fn new<'a>(
>>
>> /// Called when the corresponding [`Device`](device::Device) is unbound.
>> ///
>> + /// Prepares the GPU for unbinding by shutting down the GSP and unregistering the sysmem flush
>> + /// memory page.
>> + ///
>> /// Note: This method must only be called from `Driver::unbind`.
>> pub(crate) fn unbind(&self, dev: &device::Device<device::Core>) {
>> let Ok(bar) = kernel::warn_on_err!(self.bar.access(dev)) else {
>> return;
>> };
>>
>> + let _ = kernel::warn_on_err!(self.gsp.unload(dev, bar, &self.gsp_falcon));
>> +
>
> If I remember correctly, at least on blackwell, doing the full unloading
> procedure here actually resets the sysmem flush register, so you get a
> spurious warning. In my local branch I actually swapped the order of
> this and unregister to get rid of it (not sure if this is correct though).
> My sysmem flush patch that skips printing the warning if the value is 0
> would also fix this, if we care. Have you noticed this happening too?
I haven't - this patch works fine and without any warning on my
Blackwell card. But that deserves further investigation, so let's
revisit once we add the Blackwell series and your own patch, as this
series only supports Turing/Ampere for now.
>
>> self.sysmem_flush.unregister(bar);
>> }
>> }
>> diff --git a/drivers/gpu/nova-core/gsp/boot.rs b/drivers/gpu/nova-core/gsp/boot.rs
>> index 18f356c9178e..3f4e99b2497b 100644
>> --- a/drivers/gpu/nova-core/gsp/boot.rs
>> +++ b/drivers/gpu/nova-core/gsp/boot.rs
>> @@ -33,6 +33,7 @@
>> },
>> gpu::Chipset,
>> gsp::{
>> + cmdq::Cmdq,
>> commands,
>> sequencer::{
>> GspSequencer,
>> @@ -237,4 +238,43 @@ pub(crate) fn boot(
>>
>> Ok(())
>> }
>> +
>> + /// Shut down the GSP and wait until it is offline.
>> + fn shutdown_gsp(
>> + cmdq: &Cmdq,
>> + bar: &Bar0,
>> + gsp_falcon: &Falcon<Gsp>,
>> + suspend: bool,
>> + ) -> Result<()> {
>> + // Send command to shutdown GSP and wait for response.
>> + cmdq.send_command(bar, commands::UnloadingGuestDriver::new(suspend))?;
>> +
>> + // Wait until GSP signals it is suspended.
>> + const LIBOS_INTERRUPT_PROCESSOR_SUSPENDED: u32 = 0x8000_0000;
>
> If this can change based on firmware, should it be taken in via
> bindings? I also noticed in openrm 595, this is waited on by checking the
> bit rather than by strict equality (see _kgspIsProcessorSuspended). So
> it may be more defensive to check the bit rather than strict equality
> (even though that is correct for 570 according to openrm code).
Indeed, in 570.144 the code is actually
return (mailbox == 0x80000000);
... but I checked against `main` and it has been changed to what you
said, so testing the bit is probably better indeed.
The value is also a file-local constant, so not something we can get
through bindings unfortunately. :/ But I suspect we can rely on it being
stable.
>
>> + read_poll_timeout(
>> + || Ok(gsp_falcon.read_mailbox0(bar)),
>> + |&mb0| mb0 == LIBOS_INTERRUPT_PROCESSOR_SUSPENDED,
>> + Delta::from_millis(10),
>> + Delta::from_secs(5),
>> + )
>> + .map(|_| ())
>> + }
>> +
>> + /// Attempts to unload the GSP firmware.
>> + ///
>> + /// This stops all activity on the GSP.
>> + pub(crate) fn unload(
>> + &self,
>> + dev: &device::Device<device::Bound>,
>> + bar: &Bar0,
>> + gsp_falcon: &Falcon<Gsp>,
>> + ) -> Result {
>> + // Shut down the GSP.
>> +
>> + Self::shutdown_gsp(&self.cmdq, bar, gsp_falcon, false)
>> + .inspect_err(|e| dev_err!(dev, "unload guest driver failed: {:?}", e))?;
>
> It looks like "suspend" is only ever false here? Will this be used
> later? If we want to keep this, it may be nice to use a 2 discriminant
> enum so we don't have misc boolean parameters hanging around.
It is in prevision of suspend/resume support yes. Agreed about the enum.
>
> nit: dev_err! should have \n?
It should!
>
>> + dev_dbg!(dev, "GSP shut down\n");
>> +
>> + Ok(())
>> + }
>> }
>> diff --git a/drivers/gpu/nova-core/gsp/commands.rs b/drivers/gpu/nova-core/gsp/commands.rs
>> index c80df421702c..fb94460c451e 100644
>> --- a/drivers/gpu/nova-core/gsp/commands.rs
>> +++ b/drivers/gpu/nova-core/gsp/commands.rs
>> @@ -237,3 +237,39 @@ pub(crate) fn gpu_name(&self) -> core::result::Result<&str, GpuNameError> {
>> pub(crate) fn get_gsp_info(cmdq: &Cmdq, bar: &Bar0) -> Result<GetGspStaticInfoReply> {
>> cmdq.send_command(bar, GetGspStaticInfo)
>> }
>> +
>> +pub(crate) struct UnloadingGuestDriver {
>> + suspend: bool,
>> +}
>
> This feels like it only makes sense to call from within the gsp module,
> so I wonder if it can be pub(super) (prolly a few others in this file
> could be too, ofc not relevant for this series).
I'll review that, we do want to limit visibility as much as possible.
>
> nit: Should this have doc comment?
Yep, I'll add that.
>
>> +
>> +impl UnloadingGuestDriver {
>> + pub(crate) fn new(suspend: bool) -> Self {
>> + Self { suspend }
>> + }
>> +}
>> +
>> +impl CommandToGsp for UnloadingGuestDriver {
>> + const FUNCTION: MsgFunction = MsgFunction::UnloadingGuestDriver;
>> + type Command = fw::commands::UnloadingGuestDriver;
>> + type Reply = UnloadingGuestDriverReply;
>> + type InitError = Infallible;
>> +
>> + fn init(&self) -> impl Init<Self::Command, Self::InitError> {
>> + fw::commands::UnloadingGuestDriver::new(self.suspend)
>> + }
>> +}
>> +
>> +pub(crate) struct UnloadingGuestDriverReply;
>> +
>> +impl MessageFromGsp for UnloadingGuestDriverReply {
>> + const FUNCTION: MsgFunction = MsgFunction::UnloadingGuestDriver;
>> + type InitError = Infallible;
>> + type Message = ();
>> +
>> + fn read(
>> + _msg: &Self::Message,
>> + _sbuffer: &mut SBufferIter<array::IntoIter<&[u8], 2>>,
>> + ) -> Result<Self, Self::InitError> {
>> + Ok(UnloadingGuestDriverReply)
>> + }
>> +}
>> diff --git a/drivers/gpu/nova-core/gsp/fw.rs b/drivers/gpu/nova-core/gsp/fw.rs
>> index 0c8a74f0e8ac..59b4c4883185 100644
>> --- a/drivers/gpu/nova-core/gsp/fw.rs
>> +++ b/drivers/gpu/nova-core/gsp/fw.rs
>> @@ -278,6 +278,7 @@ pub(crate) enum MsgFunction {
>> Nop = bindings::NV_VGPU_MSG_FUNCTION_NOP,
>> SetGuestSystemInfo = bindings::NV_VGPU_MSG_FUNCTION_SET_GUEST_SYSTEM_INFO,
>> SetRegistry = bindings::NV_VGPU_MSG_FUNCTION_SET_REGISTRY,
>> + UnloadingGuestDriver = bindings::NV_VGPU_MSG_FUNCTION_UNLOADING_GUEST_DRIVER,
>>
>> // Event codes
>> GspInitDone = bindings::NV_VGPU_MSG_EVENT_GSP_INIT_DONE,
>> @@ -322,6 +323,9 @@ fn try_from(value: u32) -> Result<MsgFunction> {
>> Ok(MsgFunction::SetGuestSystemInfo)
>> }
>> bindings::NV_VGPU_MSG_FUNCTION_SET_REGISTRY => Ok(MsgFunction::SetRegistry),
>> + bindings::NV_VGPU_MSG_FUNCTION_UNLOADING_GUEST_DRIVER => {
>> + Ok(MsgFunction::UnloadingGuestDriver)
>> + }
>>
>> // Event codes
>> bindings::NV_VGPU_MSG_EVENT_GSP_INIT_DONE => Ok(MsgFunction::GspInitDone),
>> diff --git a/drivers/gpu/nova-core/gsp/fw/commands.rs b/drivers/gpu/nova-core/gsp/fw/commands.rs
>> index db46276430be..71c8690c9322 100644
>> --- a/drivers/gpu/nova-core/gsp/fw/commands.rs
>> +++ b/drivers/gpu/nova-core/gsp/fw/commands.rs
>> @@ -129,3 +129,26 @@ unsafe impl AsBytes for GspStaticConfigInfo {}
>> // SAFETY: This struct only contains integer types for which all bit patterns
>> // are valid.
>> unsafe impl FromBytes for GspStaticConfigInfo {}
>> +
>> +/// Payload of the `UnloadingGuestDriver` command and message.
>> +#[repr(transparent)]
>> +#[derive(Clone, Copy, Debug, Zeroable)]
>> +pub(crate) struct UnloadingGuestDriver(bindings::rpc_unloading_guest_driver_v1F_07);
>> +
>> +impl UnloadingGuestDriver {
>> + pub(crate) fn new(suspend: bool) -> Self {
>> + Self(bindings::rpc_unloading_guest_driver_v1F_07 {
>> + bInPMTransition: u8::from(suspend),
>> + bGc6Entering: 0,
>> + newLevel: if suspend { 3 } else { 0 },
>
> Why '3'? Is there a binding that it makes sense to use for this?
It's for suspend level 3 (suspend to RAM) if `suspend` is true, or
normal destructive unloading otherwise. OpenRM has a set of possible
values (`NV2080_CTRL_GPU_SET_POWER_STATE_GPU_LEVEL_*`) that directly
translate to the corresponding number, but at least they limit the
possible values to the valid set. I'll add an enum and the corresponding
bindings.
Thanks for the review!