Re: [PATCH 3/6] gpu: nova-core: gsp: move boot code into local closure

From: Eliot Courtney

Date: Tue Jun 23 2026 - 01:41:03 EST

On Tue Jun 23, 2026 at 12:04 AM JST, Alexandre Courbot wrote:
> On Mon Jun 22, 2026 at 4:57 PM JST, Eliot Courtney wrote:
>> On Fri Jun 19, 2026 at 10:42 PM JST, Alexandre Courbot wrote:
>>> The next patch aims at replacing the cumbersome `BootUnloadGuard` with a
>>> more local and less intrusive mechanism to run the GSP unload sequence
>>> upon GSP boot failure. Doing so requires running the boot code in a
>>> local closure, which changes its indentation and would make other
>>> changes difficult to track in the diff. Thus, this preparatory patch
>>> moves said boot code into a local closure that is run upon construction,
>>> so the next patch does not need to re-indent code that changes.
>>>
>>> This is a mechanical preparatory patch to make the next patch easier to
>>> read. No functional change intended.
>>>
>>> Signed-off-by: Alexandre Courbot <acourbot@xxxxxxxxxx>
>>> ---
>>
>> I agree with removing BootUnloadGuard, but I think it's not great to do
>> a bunch of lifting into closures then manually handling the result. It's
>> error prone imo (we already had several bugs relating to this kind of
>> thing). Instead, what about just using ScopeGuard directly? This lets us
>> avoid lifting into closures (which is a bit noisy) and avoids manual
>> result handling for failures (which is a bit error prone). With the
>> `GspBootContext` it's fairly easy to do now:
>>
>> ```
>> let unload_guard = ScopeGuard::new_with_data(unload_bundle, |unload_bundle| {
>> let _ = gsp.unload(ctx, unload_bundle);
>> });
>> ```
>
> Yes, initially I wanted to use `ScopeGuard` but ran into issues when
> trying to make the references mutable. However it looks like you have
> been able to overcome these issues, thanks for taking the time to craft
> a patch!
>
>>
>> I confirmed that it's also compatible with the v2 of this series that
>> has the mutable Fsp - you can stash the context inside the ScopeGuard
>> data (then making a &mut reference to the stashed context for brevity)
>> or have a separate unload context type that doesn't use FSP or something
>> (could later be type parametrized along with Gsp, for example).
>>
>> For example here is a rough diff on top of this patch series (you can
>> change the Result<Option<UnloadBundle>> returns to like
>> Result<Result<UnloadBundle>> if you want to centralise teh error
>> handling of a failed unloadbundle although currently it can only fail in
>> one location):
>
> Yes, looking at it it looks like a cleaner approach than using closures.
>
> The only thing that I saw as a regression is that now each HAL needs to
> call `Gsp::unload` itself in its own `ScopeGuard`. I don't think that's
> the HAL's work - `Gsp::boot` should be the centralized point where this
> happens.
>
> But we can make both work if `unload_bundle` is passed as an output
> argument to `GspHal::boot` instead of being returned:
>
> let mut guard = ScopeGuard::new_with_data((ctx, None), ...);
> let (ctx, unload_bundle) = &mut *guard;
>
> // `unload_bundle` is a mutable reference to the
> // `Option<UnloadBundle>` in `guard`.
> hal.boot(&self, ctx, unload_bundle, &fb_layout, &wpr_meta)?;
>
> The boot method can fill `unload_bundle` early, and if it returns an
> error then the `ScopeGuard` will be able to use it. Also, and that's
> nice, the HALs don't need to use `ScopeGuard`. But output arguments
> aren't really something we expect to see in Rust.
>
> Another alternative would be to separate the unload bundle construction
> from `GspHal::boot`:
>
> let unload_bundle = hal.build_unload_bundle(ctx, ...);
> let guard = ScopeGuard::new_with_data((ctx, unload_bundle), ...);
>
> hal.boot(...)?;
>
> It removes the "1 boot -> 1 unload bundle" symmetry, but on the other
> hand it also splits concerns more clearly. And it removes the awkward
> return type of `GspHal::boot`, which come to think of it was another
> smell that things were not in the right place.
>
> The main drawback I see is that we now need to build `Vbios` twice for
> TU102, since it is needed both for the unload bundle and for booting.
> But the solution is the same as what the v2 of this series does to
> `Fsp`: the BIOS is a GPU-wide subdevice that is likely to be used
> elsewhere, not something to be confined in a GSP HAL. So I say, let's
> extract it and make it also part of `GspBootContext`.
>
> How does that sound?

I think that currently it's confusing because we have two concepts in
use with very similar names. We have "unloadbundle" meaning what we need
to run to unload the driver, and we have "unload_guard" which is for
running unwind stuff if we error. And for the most part these things are
the same, but they might not be (e.g. in my other series where we need
to keep certain DMA allocations alive just for the error path, but not
for an unload later, or when we are partially loaded). So it might be
nice to make these two things more separate.

I think that trying to build the unload bundle separately
(`build_unload_bundle`) is confusing because e.g. hal.boot() still needs
to handle unwinding its state in case of error, so it adds a strong
assumption that unloading in success is the same as unwinding in error.

If we switch to a model where the caller owns Gsp::unload, we need to
support prolonging some state until gsp.unload() finishes, e.g. in my
blackwell fixes series, we need to prolong DMA allocation in the error
case for `FmcBootArgs`. It's doable to prolong the DMA allocation
because the callee currently owns running the Gsp::unload. But we would
need to do some gymnastics if the caller owns running Gsp::unload and I
can't figure out a nice way to do this. In general, if we have partially
initialized hardware, it seems likely to me we will have to handle
keeping a certain set of resources alive until we can guarantee they are
not accessed anymore (gsp inactive being a clear marker).

Separately, I think it would be good to explicitly define the semantics
of what the state should be in after each one of these calls in
Gsp::boot. For example:

"Callees that fail should call gsp.unload(ctx, unwind_bundle) to unload
if it's required. If it succeeds, return an unload bundle."

For state that should be unwound *before* Gsp::unload, I think the HAL
methods should handle it internally (regardless of what we decide for
caller vs callee ownership of running gsp.unload). But we do have to
deal with unwind that needs to happen after gsp shutdown unfortunately.
It might be nice to make this more explicit.

So it kinda sucks to have to repeat Gsp::unload in HALs, but also I
think that whether Gsp::unload is required depends on how far we have
initialized, so HALs may not want to actually call it. So IMO it makes
some sense there too to have them decide. (e.g. if we only loaded Vbios
so far).

So concretely, IMO:
1. Keep gsp.unload() the responsibility of the callees, because they may
need to prolong state until after gsp.unload, and, maybe we are not
initialized enough to want to call gsp.unload yet and callees know
this best.
2. Through naming, explicitly differentiate between unload and unwind.

WDYT?