Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel) wrote:On 3/24/21 2:48 PM, Jason Gunthorpe wrote:Why? The insert function is walking the page tables, it just updates
On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström (Intel) wrote:Hmm, yes, but we would in that case be limited anyway to insert ranges
We are trying to make a sensible driver API to deal with huge pages.In an ideal world the creation/destruction of page table levels wouldHmm, but I'm not sure what problem we're trying to solve by changing the
by dynamic at this point, like THP.
interface in this way?
Currently if the core vm requests a huge pud, we give it one, and if weWell, my thought would be to move the pte related stuff into
can't or don't want to (because of dirty-tracking, for example, which is
always done on 4K page-level) we just return VM_FAULT_FALLBACK, and the
fault is retried at a lower level.
vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
I don't know if the locking works out, but it feels cleaner that the
driver tells the vmf how big a page it can stuff in, not the vm
telling the driver to stuff in a certain size page which it might not
want to do.
Some devices want to work on a in-between page size like 64k so they
can't form 2M pages but they can stuff 64k of 4K pages in a batch on
every fault.
smaller than and equal to the fault size to avoid extensive and possibly
unnecessary checks for contigous memory.
things as they are. It learns the arragement for free while doing the
walk.
The device has to always provide consistent data, if it overlaps into
pages that are already populated that is fine so long as it isn't
changing their addresses.
And then if we can't support the full fault size, we'd need toYou don't really need to care about levels, the device should be
either presume a size and alignment of the next level or search for
contigous memory in both directions around the fault address,
perhaps unnecessarily as well.
faulting in the largest memory regions it can within its efficiency.
If it works on 4M pages then it should be faulting 4M pages. The page
size of the underlying CPU doesn't really matter much other than some
tuning to impact how the device's allocator works.
I agree with Jason here.
We get the best efficiency when we look at the what the GPU driver provides and make sure that we handle one GPU page at once instead of looking to much into what the CPU is doing with it's page tables.
At least one AMD GPUs the GPU page size can be anything between 4KiB and 2GiB and if we will in a 2GiB chunk at once this can in theory be handled by just two giant page table entries on the CPU side.