Re: [PATCH 0/6] rust: page: Support borrowing `struct page` and physaddr conversion
From: Asahi Lina
Date: Tue Feb 04 2025 - 12:59:37 EST
On 2/4/25 11:38 PM, David Hildenbrand wrote:
>>>> If the answer is "no" then that's fine. It's still an unsafe function
>>>> and we need to document in the safety section that it should only be
>>>> used for memory that is either known to be allocated and pinned and
>>>> will
>>>> not be freed while the `struct page` is borrowed, or memory that is
>>>> reserved and not owned by the buddy allocator, so in practice correct
>>>> use would not be racy with memory hot-remove anyway.
>>>>
>>>> This is already the case for the drm/asahi use case, where the pfns
>>>> looked up will only ever be one of:
>>>>
>>>> - GEM objects that are mapped to the GPU and whose physical pages are
>>>> therefore pinned (and the VM is locked while this happens so the
>>>> objects
>>>> cannot become unpinned out from under the running code),
>>>
>>> How exactly are these pages pinned/obtained?
>>
>> Under the hood it's shmem. For pinning, it winds up at
>> `drm_gem_get_pages()`, which I think does a `shmem_read_folio_gfp()` on
>> a mapping set as unevictable.
>
> Thanks. So we grab another folio reference via shmem_read_folio_gfp()-
>>shmem_get_folio_gfp().
>
> Hm, I wonder if we might end up holding folios residing in ZONE_MOVABLE/
> MIGRATE_CMA longer than we should.
>
> Compared to memfd_pin_folios(), which simulates FOLL_LONGTERM and makes
> sure to migrate pages out of ZONE_MOVABLE/MIGRATE_CMA.
>
> But that's a different discussion, just pointing it out, maybe I'm
> missing something :)
I think this is a little over my head. Though I only just realized that
we seem to be keeping the GEM objects pinned forever, even after unmap,
in the drm-shmem core API (I see no drm-shmem entry point that would
allow the sgt to be freed and its corresponding pages ref to be dropped,
other than a purge of purgeable objects or final destruction of the
object). I'll poke around since this feels wrong, I thought we were
supposed to be able to have shrinker support for swapping out whole GPU
VMs in the modern GPU MM model, but I guess there's no implementation of
that for gem-shmem drivers yet...?
That's a discussion for the DRM side though.
>
>>
>> I'm not very familiar with the innards of that codepath, but it's
>> definitely an invariant that GEM objects have to be pinned while they
>> are mapped in GPU page tables (otherwise the GPU would end up accessing
>> freed memory).
>
> Right, there must be a raised reference.
>
[...]
>>>>>> Another case struct page can be freed is when hugetlb vmemmap
>>>>>> optimization
>>>>>> is used. Muchun (cc'd) is the maintainer of hugetlbfs.
>>>>>
>>>>> Here, the "struct page" remains valid though; it can still be
>>>>> accessed,
>>>>> although we disallow writes (which would be wrong).
>>>>>
>>>>> If you only allocate a page and free it later, there is no need to
>>>>> worry
>>>>> about either on the rust side.
>>>>
>>>> This is what the safe API does. (Also the unsafe physaddr APIs if all
>>>> you ever do is convert an allocated page to a physaddr and back, which
>>>> is the only thing the GPU page table code does during normal use. The
>>>> walking leaf PFNs story is only for GPU device coredumps when the
>>>> firmware crashes.)
>>>
>>> I would hope that we can lock down this interface as much as possible.
>>
>> Right, that's why the safe API never does any of the weird pfn->page
>> stuff. Rust driver code has to use unsafe {} to access the raw pfn->page
>> interface, which requires a // SAFETY comment explaining why what it's
>> doing is safe, and then we need to document in the function signature
>> what the safety requirements are so those comments can be reviewed.
>>
>>> Ideally, we would never go from pfn->page, unless
>>>
>>> (a) we remember somehow that we came from page->pfn. E.g., we allocated
>>> these pages or someone else provided us with these pages. The
>>> memmap
>>> cannot go away. I know it's hard.
>>
>> This is the common case for the page tables. 99% of the time this is
>> what the driver will be doing, with a single exception (the root page
>> table of the firmware/privileged VM is a system reserved memory region,
>> and falls under (b). It's one single page globally in the system.).
>
> Makes sense.
>
>>
>> The driver actually uses the completely unchecked interface in this
>> case, since it knows the pfns are definitely OK. I do a single check
>> with the checked interface at probe time for that one special-case pfn
>> so it can fail gracefully instead of oops if the DT config is
>> unusable/wrong.
>>
>>> (b) the pages are flagged as being special, similar to
>>> __ioremap_check_ram().
>>
>> This only ever happens during firmware crash dumps (plus the one
>> exception above).
>>
>> The missing (c) case is the kernel/firmware shared memory GEM objects
>> during crash dumps.
>
> If it's only for crash dumps etc. that might even be opt-in, it makes
> the whole thing a lot less scary. Maybe this could be opt-in somewhere,
> to "unlock" this interface? Just an idea.
Just to make sure we're on the same page, I don't think there's anything
to unlock in the Rust abstraction side (this series). At the end of the
day, if nothing else, the unchecked interface (which the regular
non-crash page table management code uses for performance) will let you
use any pfn you want, it's up to documentation and human review to
specify how it should be used by drivers. What Rust gives us here is the
mandatory `unsafe {}`, so any attempts to use this API will necessarily
stick out during review as potentially dangerous code that needs extra
scrutiny.
For the client driver itself, I could gate the devcoredump stuff behind
a module parameter or something... but I don't think it's really worth
it. We don't have a way to reboot the firmware or recover from this
condition (platform limitations), so end users are stuck rebooting to
get back a usable machine anyway. If something goes wrong in the
crashdump code and the machine oopses or locks up worse... it doesn't
really make much of a difference for normal end users. I don't think
this will ever really happen given the constraints I described, but if
somehow it does (some other bug somewhere?), well... the machine was
already in an unrecoverable state anyway.
It would be nice to have userspace tooling deployed by default that
saves off the devcoredump somewhere, so we can have a chance at
debugging hard-to-hit firmware crashes... if it's opt-in, it would only
really be useful for developers and CI machines.
There *is* a system-global devcoredump disable, but it's not exposed
outside of the devcoredump core. It just lets coredumps happen and then
throws away the result. It might be worth sending out a patch to expose
that to drivers, so they can skip the whole coredump generation
machinery if it's unnecessary.
>> But I really need those to diagnose firmware
>> crashes. Of course, I could dump them separately through other APIs in
>> principle, but that would complicate the crashdump code quite a bit
>> since I'd have to go through all the kernel GPU memory allocators and
>> dig out all their backing GEM objects and copy the memory through their
>> vmap (they are all vmapped, which is yet another reason in practice the
>> pages are pinned) and merge it into the coredump file. I also wouldn't
>> have easy direct access to the matching GPU PTEs if I do that (I store
>> the PTE permission/caching bits in the coredump file, since those are
>> actually kind of critical to diagnose exactly what happened, as caching
>> issues are one major cause of firmware problems). Since I need the page
>> table walker code to grab the firmware pages anyway, I hope I can avoid
>> having to go through a completely different codepath for the kernel GEM
>> objects...
>
> Makes sense.
>
~~ Lina