Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

From: Thomas Hellström (Intel)
Date: Thu Mar 25 2021 - 07:54:35 EST



On 3/25/21 12:30 PM, Jason Gunthorpe wrote:
On Thu, Mar 25, 2021 at 10:51:35AM +0100, Thomas Hellström (Intel) wrote:

Please explain that further. Why do we need the mmap lock to insert PMDs
but not when insert PTEs?
We don't. But once you've inserted a PMD directory you can't remove it
unless you have the mmap lock (and probably also the i_mmap_lock in write
mode). That for example means that if you have a VRAM region mapped with
huge PMDs, and then it gets evicted, and you happen to read a byte from it
when it's evicted and therefore populate the full region with PTEs pointing
to system pages, you can't go back to huge PMDs again without a munmap() in
between.
This is all basically magic to me still, but THP does this
transformation and I think what it does could work here too. We
probably wouldn't be able to upgrade while handling fault, but at the
same time, this should be quite rare as it would require the driver to
have supplied a small page for this VMA at some point.

IIRC THP handles this using khugepaged, grabbing the lock in write mode when coalescing, and yeah, I don't think anything prevents anyone from extending khugepaged doing that also for special huge page table entries.


Apart from that I still don't fully get why we need this in the first
place.
Because virtual huge page address boundaries need to be aligned with
physical huge page address boundaries, and mmap can happen before bos are
populated so you have no way of knowing how physical huge page
address
But this is a mmap-time problem, fault can't fix mmap using the wrong VA.

Nope. The point here was that in this case, to make sure mmap uses the correct VA to give us a reasonable chance of alignement, the driver might need to be aware of and do trickery with the huge page-table-entry sizes anyway, although I think in most cases a standard helper for this can be supplied.

/Thomas



I really don't see that either. When a buffer is accessed by the CPU it
is in > 90% of all cases completely accessed. Not faulting in full
ranges is just optimizing for a really unlikely case here.
It might be that you're right, but are all drivers wanting to use this like
drm in this respect? Using the interface to fault in a 1G range in the hope
it could map it to a huge pud may unexpectedly consume and populate some 16+
MB of page tables.
If the underlying device block size is so big then sure, why not? The
"unexpectedly" should be quite rare/non existant anyhow.

Jason