Re: [PATCH 01/31] mm: use pmdp_get_lockless() without surplus barrier()

From: Peter Xu
Date: Fri May 26 2023 - 12:49:55 EST


On Thu, May 25, 2023 at 03:35:01PM -0700, Hugh Dickins wrote:
> On Wed, 24 May 2023, Peter Xu wrote:
> > On Sun, May 21, 2023 at 09:49:45PM -0700, Hugh Dickins wrote:
> > > Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
> > > reliable result with PAE (or READ_ONCE as before without PAE); and remove
> > > the unnecessary extra barrier()s which got left behind in its callers.
> >
> > Pure question: does it mean that some of below path (missing barrier()
> > ones) could have problem when CONFIG_PAE, hence this can be seen as a
> > (potential) bug fix?
>
> I don't think so; or at least, I am not claiming that this fixes any.
>
> It really depends on what use is made of the pmdval afterwards, and
> I've not checked through them. The READ_ONCE()s which were there,
> were good enough to make sure that the compiler did not reevaluate
> the pmdval later on, with perhaps a confusingly different result.
>
> But, at least in the x86 PAE case, they were not good enough to ensure
> that the two halves of the entry match up; and, sad to say, nor is that
> absolutely guaranteed by these conversions to pmdp_get_lockless() -
> because of the "HOWEVER" below. PeterZ's comments in linux/pgtable.h
> are well worth reading through.

Yes exactly - that's one major thing of my confusion on using
{ptep|pmdp}_get_lockless().

In irqoff ctx, AFAICT we can see a totally messed up pte/pmd with present
bit set if extremely unlucky. E.g. it can race with something like
"DONTNEED (contains tlbflush) then a POPULATE_WRITE" so we can have
"present -> present" conversion of pte when reading, so we can read half
pfn1 and then the other half pfn2.

The other confusing thing on this _lockless trick on PAE is, I think it
_might_ go wrong with devmap..

The problem is here we assumed even if high & low may not match, we still
can rely on most pte/pmd checks are done only on low bits (except _none()
check) to guarantee at least the checks are still atomic on low bits.

But it seems to me it's not true anymore if with pmd_trans_huge() after
devmap introduced, e.g.:

static inline int pmd_trans_huge(pmd_t pmd)
{
return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
}

#define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE)
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */

#define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
#define _PAGE_BIT_SOFTW4 58 /* available for programmer */

So after devmap with CONFIG_PAE, pmd_trans_huge() checks more than low bits
but also high bits. I didn't go further to check whether there can be any
real issue but IIUC that's not expected when the low/high trick introduced
(originally introduced in commit e585513b76f7b05d sololy for x86 PAE
fast-gup only).

>
> You might question why I made these changes at all: some days
> I question them too. Better though imperfect? Or deceptive?

I think it's probably a separate topic to address in all cases, so I think
this patch still make it slightly better on barrier() which I agree:

Acked-by: Peter Xu <peterx@xxxxxxxxxx>

Thanks,

--
Peter Xu