Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding

From: Gerald Schaefer
Date: Wed Sep 02 2020 - 08:24:52 EST


On Tue, 1 Sep 2020 16:22:22 -0700
John Hubbard <jhubbard@xxxxxxxxxx> wrote:

> On 9/1/20 10:40 AM, Gerald Schaefer wrote:
> > On Mon, 31 Aug 2020 12:15:53 -0700
> > Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> ...
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index e8cbc2e795d5..43dacbce823f 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -681,6 +681,38 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> > })
> > #endif
> >
> > +/*
> > + * With dynamic page table levels on s390, the static pXd_addr_end() functions
> > + * will not return corresponding dynamic boundaries. This is no problem as long
> > + * as only pXd pointers are passed down during page table walk, because
> > + * pXd_offset() will simply return the given pointer for folded levels, and the
> > + * pointer iteration over a range simply happens at the correct page table
> > + * level.
> > + * It is however a problem with gup_fast, or other places walking the page
> > + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
> > + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
> > + * a stack variable, which cannot be used for pointer iteration at the correct
> > + * level. Instead, the iteration then has to happen by going up to pgd level
> > + * again. To allow this, provide pXd_addr_end_folded() functions with an
> > + * additional pXd value parameter, which can be used on s390 to determine the
> > + * folding level and return the corresponding boundary.
>
> Ah OK, I finally see what you have in mind. And as Jason noted, if we just
> pass an additional parameter to pXd_addr_end() that's going to be
> cleaner. And doing so puts this in line with other page table
> abstractions that also carry more information than some architectures
> need. For example, on x86, set_pte_at() ignores the first two
> parameters:
>
> #define set_pte_at(mm, addr, ptep, pte) native_set_pte_at(mm, addr, ptep, pte)
>
> static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep , pte_t pte)
> {
> native_set_pte(ptep, pte);
> }
>
> This type of abstraction has worked out very well, IMHO.

Yes, it certainly feels like the right way to do it, and it would
not affect other archs in a functional way. It would however introduce
a subtle change for s390 behavior on _all_ page table walkers, not
just the READ_ONCE gup_fast path, i.e. it changes the level at which
the pointer iteration is done. Of course, that *should* not have any
functional issues, or else it would also be broken in gup_fast, but
in this area we often were wrong with should / could assumptions...

At least it could have some (minor) performance impact on s390,
due to going up to pgd level again instead of staying at the
folded level, but only for crossing pud/p4d boundaries, so that
*should* be neglectable.

So, while totally agreeing that adding the pXd parameter to
pXd_addr_end() in general looks like the cleanest way, we will at
least need to give this some proper testing on s390.

One other question that came up here while doing that generalized
approach was "how to submit it", i.e. should it be split up in
multiple patches which might be cleaner, or have it all-in-one,
which would simplify Fixes/stable tags. After all, we must not forget
that this fixes a severe data integrity issue on s390, so we don't
want to unnecessarily complicate or defer backports for that.

That being said, maybe a compromise could be to split it up
in a way that first introduces the pXd_addr_end_folded()
helpers for gup only, with proper Fixes/stable tags? Then
do the generalization with more thorough testing for s390
after that, possibly also with Fixes/stable tags so that we
do not have different code in stable and upstream?