Re: [PATCH 0/6] rust: page: Support borrowing `struct page` and physaddr conversion

From: David Hildenbrand
Date: Tue Feb 04 2025 - 09:38:36 EST


It can still race with memory offlining, and it refuses ZONE_DEVICE
pages. For the latter, we have a different way to check validity. See
memory_failure() that first calls pfn_to_online_page() to then check
get_dev_pagemap().

I'll give it a shot with these functions. If they work for my use case,
> then it's good to have extra checks and I'll add them for v2. Thanks!

Let me know if you run into any issues.




If the answer is "no" then that's fine. It's still an unsafe function
and we need to document in the safety section that it should only be
used for memory that is either known to be allocated and pinned and will
not be freed while the `struct page` is borrowed, or memory that is
reserved and not owned by the buddy allocator, so in practice correct
use would not be racy with memory hot-remove anyway.

This is already the case for the drm/asahi use case, where the pfns
looked up will only ever be one of:

- GEM objects that are mapped to the GPU and whose physical pages are
therefore pinned (and the VM is locked while this happens so the objects
cannot become unpinned out from under the running code),

How exactly are these pages pinned/obtained?

Under the hood it's shmem. For pinning, it winds up at
`drm_gem_get_pages()`, which I think does a `shmem_read_folio_gfp()` on
a mapping set as unevictable.

Thanks. So we grab another folio reference via shmem_read_folio_gfp()->shmem_get_folio_gfp().

Hm, I wonder if we might end up holding folios residing in ZONE_MOVABLE/MIGRATE_CMA longer than we should.

Compared to memfd_pin_folios(), which simulates FOLL_LONGTERM and makes sure to migrate pages out of ZONE_MOVABLE/MIGRATE_CMA.

But that's a different discussion, just pointing it out, maybe I'm missing something :)


I'm not very familiar with the innards of that codepath, but it's
definitely an invariant that GEM objects have to be pinned while they
are mapped in GPU page tables (otherwise the GPU would end up accessing
freed memory).

Right, there must be a raised reference.


Since the code that walks the PT to dump pages is part of the same PT
object and takes a mutable reference, the Rust guarantees mean it's
impossible for the PT to be concurrently mutated or anything like that.
So if one of these objects *were* unpinned/freed somehow while the dump
code is running, that would be a major bug somewhere else, since there
would be dangling PTEs left over.

In practice, there's a big lock around each PT/VM at a higher level of
the driver, so any attempts to unmap/free any of those objects will be
stuck waiting for the lock on the VM they are mapped into.

Understood, thanks.

[...]


Another case struct page can be freed is when hugetlb vmemmap
optimization
is used. Muchun (cc'd) is the maintainer of hugetlbfs.

Here, the "struct page" remains valid though; it can still be accessed,
although we disallow writes (which would be wrong).

If you only allocate a page and free it later, there is no need to worry
about either on the rust side.

This is what the safe API does. (Also the unsafe physaddr APIs if all
you ever do is convert an allocated page to a physaddr and back, which
is the only thing the GPU page table code does during normal use. The
walking leaf PFNs story is only for GPU device coredumps when the
firmware crashes.)

I would hope that we can lock down this interface as much as possible.

Right, that's why the safe API never does any of the weird pfn->page
stuff. Rust driver code has to use unsafe {} to access the raw pfn->page
interface, which requires a // SAFETY comment explaining why what it's
doing is safe, and then we need to document in the function signature
what the safety requirements are so those comments can be reviewed.

Ideally, we would never go from pfn->page, unless

(a) we remember somehow that we came from page->pfn. E.g., we allocated
    these pages or someone else provided us with these pages. The memmap
    cannot go away. I know it's hard.

This is the common case for the page tables. 99% of the time this is
what the driver will be doing, with a single exception (the root page
table of the firmware/privileged VM is a system reserved memory region,
and falls under (b). It's one single page globally in the system.).

Makes sense.


The driver actually uses the completely unchecked interface in this
case, since it knows the pfns are definitely OK. I do a single check
with the checked interface at probe time for that one special-case pfn
so it can fail gracefully instead of oops if the DT config is
unusable/wrong.

(b) the pages are flagged as being special, similar to
    __ioremap_check_ram().

This only ever happens during firmware crash dumps (plus the one
exception above).

The missing (c) case is the kernel/firmware shared memory GEM objects
during crash dumps.

If it's only for crash dumps etc. that might even be opt-in, it makes the whole thing a lot less scary. Maybe this could be opt-in somewhere, to "unlock" this interface? Just an idea.

But I really need those to diagnose firmware
crashes. Of course, I could dump them separately through other APIs in
principle, but that would complicate the crashdump code quite a bit
since I'd have to go through all the kernel GPU memory allocators and
dig out all their backing GEM objects and copy the memory through their
vmap (they are all vmapped, which is yet another reason in practice the
pages are pinned) and merge it into the coredump file. I also wouldn't
have easy direct access to the matching GPU PTEs if I do that (I store
the PTE permission/caching bits in the coredump file, since those are
actually kind of critical to diagnose exactly what happened, as caching
issues are one major cause of firmware problems). Since I need the page
table walker code to grab the firmware pages anyway, I hope I can avoid
having to go through a completely different codepath for the kernel GEM
objects...

Makes sense.

--
Cheers,

David / dhildenb