Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages

From: Thomas Hellström (Intel)
Date: Fri Mar 26 2021 - 05:08:53 EST



On 3/25/21 7:24 PM, Jason Gunthorpe wrote:
On Thu, Mar 25, 2021 at 07:13:33PM +0100, Thomas Hellström (Intel) wrote:
On 3/25/21 6:55 PM, Jason Gunthorpe wrote:
On Thu, Mar 25, 2021 at 06:51:26PM +0100, Thomas Hellström (Intel) wrote:
On 3/24/21 9:25 PM, Dave Hansen wrote:
On 3/24/21 1:22 PM, Thomas Hellström (Intel) wrote:
We also have not been careful at *all* about how _PAGE_BIT_SOFTW* are
used.  It's quite possible we can encode another use even in the
existing bits.

Personally, I'd just try:

#define _PAGE_BIT_SOFTW5        57      /* available for programmer */

OK, I'll follow your advise here. FWIW I grepped for SW1 and it seems
used in a selftest, but only for PTEs AFAICT.

Oh, and we don't care about 32-bit much anymore?
On x86, we have 64-bit PTEs when running 32-bit kernels if PAE is
enabled. IOW, we can handle the majority of 32-bit CPUs out there.

But, yeah, we don't care about 32-bit. :)
Hmm,

Actually it makes some sense to use SW1, to make it end up in the same dword
as the PSE bit, as from what I can tell, reading of a 64-bit pmd_t on 32-bit
PAE is not atomic, so in theory a huge pmd could be modified while reading
the pmd_t making the dwords inconsistent.... How does that work with fast
gup anyway?
It loops to get an atomic 64 bit value if the arch can't provide an
atomic 64 bit load
Hmm, ok, I see a READ_ONCE() in gup_pmd_range(), and then the resulting pmd
is dereferenced either in try_grab_compound_head() or __gup_device_huge(),
before the pmd is compared to the value the pointer is currently pointing
to. Couldn't those dereferences be on invalid pointers?
Uhhhhh.. That does look questionable, yes. Unless there is some tricky
reason why a 64 bit pmd entry on a 32 bit arch either can't exist or
has a stable upper 32 bits..

The pte does it with ptep_get_lockless(), we probably need the same
for the other levels too instead of open coding a READ_ONCE?

Jason

TBH, ptep_get_lockless() also looks a bit fishy. it says
"it will not switch to a completely different present page without a TLB flush in between".

What if the following happens:

processor 1: Reads lower dword of PTE.
processor 2: Zaps PTE. Gets stuck waiting to do TLB flush
processor 1: Reads upper dword of PTE, which is now zero.
processor 3: Hits a TLB miss, reads an unpopulated PTE and faults in a new PTE value which happens to be the same as the original one before the zap.
processor 1: Reads the newly faulted in lower dword, compares to the old one, gives an OK and returns a bogus PTE.

/Thomas