Re: [PATCH v10 12/21] gpu: nova-core: mm: Add unified page table entry wrapper enums

From: Gary Guo

Date: Thu Apr 09 2026 - 07:02:57 EST

On Wed Apr 8, 2026 at 9:19 PM BST, Joel Fernandes wrote:
> Hi Alex, Eliot, Danilo,
>
> Thanks for taking a look. Let me respond to the specific points below.
>
> On Wed, 08 Apr 2026, Alexandre Courbot wrote:
>> After a quick look I'd say that having a trait here would actually be
>> *good* for correctness and maintainability.
>>
>> The current design implies that every operation on a page table (most
>> likely using the walker) goes through a branching point. Just looking at
>> `PtWalk::read_pte_at_level`, there are already at least 6
>> `if version == 2 { } else { }` branches that all resolve to the same
>> result. Include walking down the PDEs and you have at least a dozen of
>> these just to resolve a virtual address. I know CPUs are fast, but this
>> is still wasted cycles for no good reason.
>
> I did some measurements and there is no notieceable difference in both
> approaches. I ran perf and loaded nova with self-tests running. The extra
> potential branching is lost in the noise. In both cases, loading nova and
> running the self-tests has ~119.7M branch instructions on my Ampere. The total
> instruction count is also identical (~615M).
>
> I measured like this:
> perf stat -e
> branches,branch-misses,cache-references,cache-misses,instructions,cycles --
> modprobe nova_core
>
> So I think the branching argument is not a strong one. I also did more
> measurements and the dominant time taken is MMIO. During the map prep and
> execute, page table walks are done. A TLB flush alone costs ~1.4 microseconds.
> And PRAMIN BAR0 writes to write the PTE is also about 1 microsecond. Considering
> this, I don't think the extra branching argument holds (even without branch
> prediction and speculation).
>
> Also some branches cannot be eliminated even with parameterization:
>
> if level == self.mmu_version.dual_pde_level() {
> // 128-bit dual PDE read
> } else {
> // Regular 64-bit PDE read
> }
>
> This isn't really a version branch -- it's a structural branch that
> distinguishes between 64-bit PDE and 128-bit dual PDE entries. Any MMU
> version with a dual PDE level would need this same distinction.
>
> I also did code-generation size analysis (see diff of code used below):
>
> Code generation analysis:
>
> Module .ko size: Before: 511,792 bytes After: 524,464 bytes (+2.5%)
> .text section: Before: 112,620 bytes After: 116,628 bytes (+4,008 bytes)
>
> The +4K .text growth is the monomorphization cost: every generic function
> is compiled twice (once for MmuV2, once for MmuV3).
>
>> If you use a trait here, and make `PtWalk` generic against it, you can
>> optimize this away. We had a similar situation when we introduced Turing
>> support and the v2 ucode header, and tried both approaches: the
>> trait-based one was slightly shorter, and arguably more readable.
>
> Actually I was the one who suggested traits for Falcon ucode descriptor if you
> see this thread [1]. So basically you and Eliot are telling me to do what I
> suggested in [1]. :-) However, I disagree that it is the right choice for this code.
>
> [1] https://lore.kernel.org/all/20251117231028.GA1095236@joelbox2/
>
> I think the two cases are quite different in complexity:
>
> The falcon ucode descriptor is essentially a set of flat field accessors
> and a few params (imem_sec_load_params, dmem_load_params).
> The trait has ~10 simple getter methods. There's no multi-level hierarchy,
> no walker, and no generic propagation.
>
> The MMU page table case is structurally different. Making PtWalk generic
> over an Mmu trait would require:
>
> - PtWalk<M: Mmu> (the walker)
> - Plus all the associated types: M::Pte, M::Pde, M::DualPde each
> needing their own trait bounds
>
> And we would also need:
> - Vmm<M: Mmu> (which creates PtWalk)
> - BarUser<M: Mmu> (which creates Vmm)
>
> I am also against making Vmm an enum as Eliot suggested:
> enum Vmm {
> V2(VmmInner<MmuV2>),
> V3(VmmInner<MmuV3>),
> }
>
> That moves the version complexity up to the reader. Code complexity IMO should
> decrease as we go up abstractions, making it easier for users (Vmm/Bar).
>
> If you look at the the changes in vmm.rs to handle version dispatch there [2]:
> Added: +109
> Removed: -28
>
> [2]
> https://github.com/Edgeworth/linux/commit/3627af550b61256184d589e7ec666c1108971f0e
>
> The main benefit of my approach is version-specific dispatch complexity is
> completely isolated inside MmuVersion thus making the code outside of
> pagetable.rs much more readable, without having to parametrize anything, and
> without code size increase. I think that is worth considering.
>
>> But the main argument to use a trait here IMO is that it enables
>> associated types and constants. That's particularly critical since some
>> equivalent fields have different lengths between v2 and v3. An
>> associated `Bounded` type for these would force the caller to validate
>> the length of these fields before calling a non-fallible operation,
>> which is exactly the level of caution that we want when dealing with
>> page tables.
>
> I think Bounded validation is orthogonal to the dispatch model.
> We can add Bounded to the current design without restructuring
> into traits. For example:
>
> // In ver2::Pte
> pub fn new_vram(pfn: Bounded<Pfn, 25>, writable: bool) -> Self { ... }
>
> // In ver3::Pte
> pub fn new_vram(pfn: Bounded<Pfn, 40>, writable: bool) -> Self { ... }
>
> The unified Pte enum wrapper already dispatches to the correct
> version-specific constructor, which would enforce the correct Bounded
> constraint for that version.
>
>> In order to fully benefit from it, we will need the bitfield macro from
>> the `kernel` crate so the PDE/PTE fields can be `Bounded`, I will try to
>> make it available quickly in a patch that you can depend on.
>
> That would be great, and I'd be happy to integrate Bounded validation once
> the macro is available. I just don't think we need to restructure the
> dispatch model in order to benefit from it.
>
>> But long story short, and although I need to dive deeper into the code,
>> this looks like a good candidate for using a trait and associated types.
>
> The walker code (walk.rs) is already version-agnostic and reads cleanly.
> The version dispatch is encapsulated behind method calls, not exposed as
> inline if/else blocks.
>
> Generic propagation (or version-specific dispatch at higher levels) adds more
> complexity at higher layers.
>
> Enclosed below [3] is the diff I used for my testing with the data, I don't
> really see a net readability win there (IMO, it is a net-loss in readability).
>
> [3]
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=trait-pt-dispatch&id=5eb0e98af11ba608ff4d0f7a06065ee863f5066a

IMO this diff is quite has got me quite in favour of trait approach.

I wanted about to purpose something similar (or maybe I had already?) trait
approach some versions ago but didn't due to the eventual need of `match` like
dispatch (like you had with `vmm_dispatch`), but your code made that looks not
as bad as I thought it would be.

Best,
Gary