Re: [PATCH] mmu_notifiers: Notify on pte permission upgrades

From: Sean Christopherson
Date: Mon May 22 2023 - 14:34:46 EST


On Mon, May 22, 2023, Alistair Popple wrote:
> Some architectures, specifically ARM and perhaps Sparc and IA64,
> require TLB invalidates when upgrading pte permission from read-only
> to read-write.
>
> The current mmu_notifier implementation assumes that upgrades do not
> need notifications. Typically though mmu_notifiers are used to
> implement TLB invalidations for secondary MMUs that comply with the
> main CPU architecture.
>
> Therefore if the main CPU architecture requires an invalidation for
> permission upgrade the secondary MMU will as well and an mmu_notifier
> should be sent for the upgrade.
>
> Currently CPU invalidations for permission upgrade occur in
> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
> directly from this architecture specific code as the notifier
> callbacks can sleep, and ptep_set_access_flags() is usually called
> whilst holding the PTL spinlock. Therefore add the notifier calls
> after the PTL is dropped and only if the PTE actually changed. This
> will allow secondary MMUs to obtain an updated PTE with appropriate
> permissions.
>
> This problem was discovered during testing of an ARM SMMU
> implementation that does not support broadcast TLB maintenance
> (BTM). In this case the SMMU driver uses notifiers to issue TLB
> invalidates. For read-only to read-write pte upgrades the SMMU
> continually returned a read-only PTE to the device, even though the
> CPU had a read-write PTE installed.
>
> Sending a mmu notifier event to the SMMU driver fixes the problem by
> flushing secondary TLB entries. A new notifier event type is added so
> drivers may filter out these invalidations if not required. Note a
> driver should never upgrade or install a PTE in response to this mmu
> notifier event as it is not synchronised against other PTE operations.
>
> Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx>
> ---
> include/linux/mmu_notifier.h | 6 +++++
> mm/hugetlb.c | 24 ++++++++++++++++++-
> mm/memory.c | 45 ++++++++++++++++++++++++++++++++++--
> 3 files changed, 72 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d6c06e140277..f14d68f119d8 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -31,6 +31,11 @@ struct mmu_interval_notifier;
> * pages in the range so to mirror those changes the user must inspect the CPU
> * page table (from the end callback).
> *
> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to
> + * read-write for pages in the range. This must not be used to upgrade
> + * permissions on secondary PTEs, rather it should only be used to invalidate
> + * caches such as secondary TLBs that may cache old read-only entries.

This is a poor fit for invalidate_range_{start,end}(). All other uses bookend
the primary MMU update, i.e. call start() _before_ changing PTEs. The comments
are somewhat stale as they talk only about "unmapped", but the contract between
the primary MMU and the secondary MMU is otherwise quite clear on when the primary
MMU will invoke start() and end().

* invalidate_range_start() is called when all pages in the
* range are still mapped and have at least a refcount of one.
*
* invalidate_range_end() is called when all pages in the
* range have been unmapped and the pages have been freed by
* the VM.

I'm also confused as to how this actually fixes ARM's SMMU. Unless I'm looking
at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(),
not the start()/end() variants.

static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
.invalidate_range = arm_smmu_mm_invalidate_range,
.release = arm_smmu_mm_release,
.free_notifier = arm_smmu_mmu_notifier_free,
};

Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
is perfectly valid. And AFAICT, the existing invalidate_range() hook is pretty
much a perfect fit for what you want to achieve.

* If invalidate_range() is used to manage a non-CPU TLB with
* shared page-tables, it not necessary to implement the
* invalidate_range_start()/end() notifiers, as
* invalidate_range() already catches the points in time when an
* external TLB range needs to be flushed. For more in depth
* discussion on this see Documentation/mm/mmu_notifier.rst

Even worse, this change may silently regress performance for secondary MMUs that
haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs
in response to the upgrade, instead of waiting until the guest actually tries to
utilize the new protections.