Re: [PATCH v10 07/21] gpu: nova-core: mm: Add TLB flush support

From: Joel Fernandes

Date: Mon Apr 06 2026 - 16:52:35 EST

On Thu, 02 Apr 2026 14:49:05 +0900, Eliot Courtney wrote:

> > + /// TLB flush serialization lock: This lock is acquired during the
> > + /// DMA fence signalling critical path. It must NEVER be held across any
> > + /// reclaimable CPU memory allocations because the memory reclaim path can
> > + /// call `dma_fence_wait()`, which would deadlock with this lock held.
> > + #[pin]
>
> This comment says that the lock is acquired during the DMA fence
> signalling critical path, but IIUC we don't have anything like that
> right now. Is this based on future yet to be done work? Can we reword
> this in a way so it makes sense in the current state?

Good point. Will reword. I'll still keep the references this design is
specifically for that, but will refine to avoid making it look like a fence
signalling usecase already exists. Thanks!

> > + /// This invalidates all TLB entries associated with the given PDB address.
> > + /// Must be called after modifying page table entries to ensure the GPU sees
> > + /// the updated mappings.
>
> If this must be called after every operation like that, I wonder if we
> can change the design to require a guard like pattern something to
> ensure flush is called. WDYT?

That's a good thought. I looked into this -- currently there are only 2
call sites (execute_map and unmap_pages in vmm.rs), and both follow the
same pattern of dropping the PRAMIN window lock before flushing. A RAII
guard would need careful lifetime management to maintain this lock
ordering, and error propagation from drop is tricky. With just 2 call
sites, the explicit flush calls are clearer and easier to audit. If more
call sites emerge in the future, we can revisit adding a guard pattern
then.

> > + pub(crate) fn flush(&self, pdb_addr: VramAddress) -> Result {
>
> Hopefully we don't need to be calling flush() from anywhere in the
> entire crate. Can you tighten the visibility here and in other places?
> Many things seem to be pub(crate) that don't need to be.

Agreed, will tighten flush() to pub(super) and do a broader visibility
audit across the series. Thanks!

> > + read_poll_timeout(
> > + || Ok(bar.read(regs::NV_TLB_FLUSH_CTRL)),
> > + |ctrl: &regs::NV_TLB_FLUSH_CTRL| !ctrl.enable(),
> > + Delta::ZERO,
> > + Delta::from_secs(2),
> > + )?;
>
> This has zero delay on the read_poll_timeout - what about adding a small
> delay of a microsecond or so?

I think Delta::ZERO is fine here -- it still calls cpu_relax() on each
iteration (not a pure busy-spin). This matches what other DRM drivers do
such polls, e.g. lima_mmu.c and panfrost_gpu.c both use read_poll_timeout
with sleep_us=0 in some places. I also measured it on GA102 and its about
~1-1.5us which I think is well within busy-loop territory. From my experience
with the scheduler, if we sleep, such short sleeps will be much longer than 1us
due to scheduling latency. I also checked OpenRM and even there it is a
sleepless spin-loop.

thanks,

--
Joel Fernandes