Re: [PATCH v2 5/6] s390/mm: Batch PTE updates in lazy MMU mode

From: Heiko Carstens

Date: Thu Apr 23 2026 - 08:30:49 EST

On Wed, Apr 15, 2026 at 05:01:23PM +0200, Alexander Gordeev wrote:
> Make use of the IPTE instruction's "Additional Entries" field to
> invalidate multiple PTEs in one go while in lazy MMU mode. This
> is the mode in which many memory-management system calls (like
> mremap(), mprotect(), etc.) update memory attributes.
>
> To achieve that, the set_pte() and ptep_get() primitives use a
> per-CPU cache to store and retrieve PTE values and apply the
> cached values to the real page table once lazy MMU mode is left.
>
> The same is done for memory-management platform callbacks that
> would otherwise cause intense per-PTE IPTE traffic, reducing the
> number of IPTE instructions from up to PTRS_PER_PTE to a single
> instruction in the best case. The average reduction is of course
> smaller.
>
> Since all existing page table iterators called in lazy MMU mode
> handle one table at a time, the per-CPU cache does not need to be
> larger than PTRS_PER_PTE entries. That also naturally aligns with
> the IPTE instruction, which must not cross a page table boundary.
>
> Before this change, the system calls did:
>
> lazy_mmu_mode_enable_pte()
> ...
> <update PTEs> // up to PTRS_PER_PTE single-IPTEs
> ...
> lazy_mmu_mode_disable()
>
> With this change, the system calls do:
>
> lazy_mmu_mode_enable_pte()
> ...
> <store new PTE values in the per-CPU cache>
> ...
> lazy_mmu_mode_disable() // apply cache with one multi-IPTE

I think what is not necessarily immediately obvious: this approach must assure
that within such a lazy mmu section there is not a single occurrence of code
which doesn't use the above mentioned modified primitives to dereference page
table entry pointers.

Directly dereferencing such pointers would bypass the cache and lead to
incorrect results. Therefore we do need some mechanism which makes sure this
cannot happen. Preferebly that would happen at compile time with static code
analysis. Alternatively your Kasan implementation would be helpful to find
something like that after-the-fact.

However in any case we need something to address this problem.

> +#if !defined(CONFIG_IPTE_BATCH) || defined(__DECOMPRESSOR)
> +static inline
> +bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + int *res)
> +{
> + return false;
> +}

...

> +bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + int *res);

Could you change this to something like:

bool __ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
int *res);

static inline
bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
int *res)
{
if (__is_defined(__DECOMPRESSOR))
return false;
return __ipte_batch_ptep_test_and_clear_young(vma, addr, ptep, res);
}

? This avoids the ifdef and makes the code easier to read and change.

> +static inline void set_pte(pte_t *ptep, pte_t pte)
> +{
> + if (!ipte_batch_set_pte(ptep, pte))
> + __set_pte(ptep, pte);
> +}

Not sure if you analyzed it, but it looks like this might be the reason why
you see the fork() slowdown: every page table operation now comes with a
function call, even if is not needed.

I guess it would be helpful to add an early exit to the ipte_batch() inlines.
E.g. going back to the example above:

static inline
bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
int *res)
{
if (__is_defined(__DECOMPRESSOR))
return false;
---> if (unlikely(!ipte_batch_active()))
---> return false;
return __ipte_batch_ptep_test_and_clear_young(vma, addr, ptep, res);
}

Where ipte_batch_active() would be an inline function which simply tests a bit
in lowcore.

> diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
> index 193899c39ca7..0f6c6de447d4 100644
> --- a/arch/s390/mm/Makefile
> +++ b/arch/s390/mm/Makefile
> @@ -11,5 +11,6 @@ obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
> obj-$(CONFIG_PTDUMP) += dump_pagetables.o
> obj-$(CONFIG_PFAULT) += pfault.o
> +obj-$(CONFIG_IPTE_BATCH) += ipte_batch.o

As already stated before: let's get rid of the Kconfig option. This will be
enabled anyway.

> +#define PTE_POISON 0

Is there a reason why the PTE_POISON value is zero? If it can be a different
value, I would like to propose a PTE which has bit 11 set (the one which must
be zero). If that would ever be transferred to page table, and it would be
used, we would see that immediately :)

> +struct ipte_batch {
> + struct mm_struct *mm;
> + unsigned long base_addr;
> + unsigned long base_end;
> + pte_t *base_pte;
> + pte_t *start_pte;
> + pte_t *end_pte;
> + pte_t cache[PTRS_PER_PTE];
> +};
> +
> +static DEFINE_PER_CPU(struct ipte_batch, ipte_range);

This is ~1MB with 512 possible CPUs. I think it would make sense to allocate
and free this data structure with CPU hotplug to avoid wasting too much
memory.

> +static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
> +{
> + pte_t pte = __ptep_get(start);
> + pte_t *ptep;
> +
> + *valid = !(pte_val(pte) & _PAGE_INVALID);
> +
> + for (ptep = start + 1; ptep < end; ptep++) {
> + pte = __ptep_get(ptep);
> + if (*valid) {
> + if (pte_val(pte) & _PAGE_INVALID)
> + break;
> + } else {
> + if (!(pte_val(pte) & _PAGE_INVALID))
> + break;
> + }
> + }
> +
> + return ptep - start;
> +}

Wouldn't it be possible to shorten this a bit? E.g.:

static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
{
unsigned long inv;
pte_t *ptep;

inv = pte_val(__ptep_get(start)) & _PAGE_INVALID;
*valid = !inv;
for (ptep = start + 1; ptep < end; ptep++) {
if (pte_val(__ptep_get(ptep)) & _PAGE_INVALID != inv)
break;
}
return ptep - start;
}

> +static void __invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
> + int nr_ptes, pte_t *ptep)
> +{
> + atomic_inc(&mm->context.flush_count);
> + if (cpu_has_tlb_lc() &&
> + cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id())))

One line please. Yes, I know, this is more or less copy-pasted, but... :)

Just a general comment about the naming conventions: imho ipte_batch is not a
nice choice, since the name of the facility is "ipte range". However I would
abstract even more, since nobody knows if there will be a different
instruction or facility to achieve all of this in a better way.

Anyway... maybe rename the file simply to mmu.c or tlb.c and change the
function prefixes accordingly so we end up with shorter function names?