Re: [PATCH v2 5/6] s390/mm: Batch PTE updates in lazy MMU mode

From: Alexander Gordeev

Date: Fri Apr 24 2026 - 09:11:04 EST

On Thu, Apr 23, 2026 at 02:28:24PM +0200, Heiko Carstens wrote:

Hi Heiko!

...
> > With this change, the system calls do:
> >
> > lazy_mmu_mode_enable_pte()
> > ...
> > <store new PTE values in the per-CPU cache>
> > ...
> > lazy_mmu_mode_disable() // apply cache with one multi-IPTE
>
> I think what is not necessarily immediately obvious: this approach must assure
> that within such a lazy mmu section there is not a single occurrence of code
> which doesn't use the above mentioned modified primitives to dereference page
> table entry pointers.
>
> Directly dereferencing such pointers would bypass the cache and lead to
> incorrect results. Therefore we do need some mechanism which makes sure this
> cannot happen. Preferebly that would happen at compile time with static code
> analysis. Alternatively your Kasan implementation would be helpful to find
> something like that after-the-fact.
>
> However in any case we need something to address this problem.

Will try to come up with something.

...
>
> > +bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > + unsigned long addr, pte_t *ptep,
> > + int *res);
>
> Could you change this to something like:
>
> bool __ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> int *res);
>
> static inline
> bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> int *res)
> {
> if (__is_defined(__DECOMPRESSOR))
> return false;
> return __ipte_batch_ptep_test_and_clear_young(vma, addr, ptep, res);
> }
>
> ? This avoids the ifdef and makes the code easier to read and change.

Yes, I also like it much better and had something similar in early versions.
But backed off to keep the whole header style intact. Anyway, will do.

> > +static inline void set_pte(pte_t *ptep, pte_t pte)
> > +{
> > + if (!ipte_batch_set_pte(ptep, pte))
> > + __set_pte(ptep, pte);
> > +}
>
> Not sure if you analyzed it, but it looks like this might be the reason why
> you see the fork() slowdown: every page table operation now comes with a
> function call, even if is not needed.
>
> I guess it would be helpful to add an early exit to the ipte_batch() inlines.
> E.g. going back to the example above:
>
> static inline
> bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> int *res)
> {
> if (__is_defined(__DECOMPRESSOR))
> return false;
> ---> if (unlikely(!ipte_batch_active()))
> ---> return false;
> return __ipte_batch_ptep_test_and_clear_young(vma, addr, ptep, res);
> }
>
> Where ipte_batch_active() would be an inline function which simply tests a bit
> in lowcore.

The bit check alone would not be enough - it should be done with the
preemption disabled. Then it would be something like lazy_mmu_mode(),
but with one or more values in the lowcore instead of percpu variable
(at least ::base_pte to check whether ptep falls into the active range).

...
> > +#define PTE_POISON 0
>
> Is there a reason why the PTE_POISON value is zero? If it can be a different
> value, I would like to propose a PTE which has bit 11 set (the one which must
> be zero). If that would ever be transferred to page table, and it would be
> used, we would see that immediately :)

The main reason was to avoid explicit initialization of per-CPU arrays.
But yes, bit 11 sounds like a much better approach ;)

> > +struct ipte_batch {
> > + struct mm_struct *mm;
> > + unsigned long base_addr;
> > + unsigned long base_end;
> > + pte_t *base_pte;
> > + pte_t *start_pte;
> > + pte_t *end_pte;
> > + pte_t cache[PTRS_PER_PTE];
> > +};
> > +
> > +static DEFINE_PER_CPU(struct ipte_batch, ipte_range);
>
> This is ~1MB with 512 possible CPUs. I think it would make sense to allocate
> and free this data structure with CPU hotplug to avoid wasting too much
> memory.

Will do.

> > +static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
> > +{
> > + pte_t pte = __ptep_get(start);
> > + pte_t *ptep;
> > +
> > + *valid = !(pte_val(pte) & _PAGE_INVALID);
> > +
> > + for (ptep = start + 1; ptep < end; ptep++) {
> > + pte = __ptep_get(ptep);
> > + if (*valid) {
> > + if (pte_val(pte) & _PAGE_INVALID)
> > + break;
> > + } else {
> > + if (!(pte_val(pte) & _PAGE_INVALID))
> > + break;
> > + }
> > + }
> > +
> > + return ptep - start;
> > +}
>
> Wouldn't it be possible to shorten this a bit? E.g.:

Will try.

> static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
> {
> unsigned long inv;
> pte_t *ptep;
>
> inv = pte_val(__ptep_get(start)) & _PAGE_INVALID;
> *valid = !inv;
> for (ptep = start + 1; ptep < end; ptep++) {
> if (pte_val(__ptep_get(ptep)) & _PAGE_INVALID != inv)
> break;
> }
> return ptep - start;
> }
>
> > +static void __invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
> > + int nr_ptes, pte_t *ptep)
> > +{
> > + atomic_inc(&mm->context.flush_count);
> > + if (cpu_has_tlb_lc() &&
> > + cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id())))
>
> One line please. Yes, I know, this is more or less copy-pasted, but... :)

Will do.

> Just a general comment about the naming conventions: imho ipte_batch is not a
> nice choice, since the name of the facility is "ipte range". However I would
> abstract even more, since nobody knows if there will be a different
> instruction or facility to achieve all of this in a better way.
>
> Anyway... maybe rename the file simply to mmu.c or tlb.c and change the
> function prefixes accordingly so we end up with shorter function names?

I would in turn suggest lazy_mmu_ prefix and lazy_mmu.c source name
to emphasize it implements the generic lazy mmu mode. At the same time
keep ipte_batch (or ipte_range rather) in the implementation itself.

Thanks a lot for the review!