Re: [PATCH] mmu_notifiers: Notify on pte permission upgrades

From: Alistair Popple
Date: Mon May 22 2023 - 20:24:46 EST



Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Mon, May 22, 2023, Alistair Popple wrote:
>> Some architectures, specifically ARM and perhaps Sparc and IA64,
>> require TLB invalidates when upgrading pte permission from read-only
>> to read-write.
>>
>> The current mmu_notifier implementation assumes that upgrades do not
>> need notifications. Typically though mmu_notifiers are used to
>> implement TLB invalidations for secondary MMUs that comply with the
>> main CPU architecture.
>>
>> Therefore if the main CPU architecture requires an invalidation for
>> permission upgrade the secondary MMU will as well and an mmu_notifier
>> should be sent for the upgrade.
>>
>> Currently CPU invalidations for permission upgrade occur in
>> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
>> directly from this architecture specific code as the notifier
>> callbacks can sleep, and ptep_set_access_flags() is usually called
>> whilst holding the PTL spinlock. Therefore add the notifier calls
>> after the PTL is dropped and only if the PTE actually changed. This
>> will allow secondary MMUs to obtain an updated PTE with appropriate
>> permissions.
>>
>> This problem was discovered during testing of an ARM SMMU
>> implementation that does not support broadcast TLB maintenance
>> (BTM). In this case the SMMU driver uses notifiers to issue TLB
>> invalidates. For read-only to read-write pte upgrades the SMMU
>> continually returned a read-only PTE to the device, even though the
>> CPU had a read-write PTE installed.
>>
>> Sending a mmu notifier event to the SMMU driver fixes the problem by
>> flushing secondary TLB entries. A new notifier event type is added so
>> drivers may filter out these invalidations if not required. Note a
>> driver should never upgrade or install a PTE in response to this mmu
>> notifier event as it is not synchronised against other PTE operations.
>>
>> Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx>
>> ---
>> include/linux/mmu_notifier.h | 6 +++++
>> mm/hugetlb.c | 24 ++++++++++++++++++-
>> mm/memory.c | 45 ++++++++++++++++++++++++++++++++++--
>> 3 files changed, 72 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d6c06e140277..f14d68f119d8 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -31,6 +31,11 @@ struct mmu_interval_notifier;
>> * pages in the range so to mirror those changes the user must inspect the CPU
>> * page table (from the end callback).
>> *
>> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to
>> + * read-write for pages in the range. This must not be used to upgrade
>> + * permissions on secondary PTEs, rather it should only be used to invalidate
>> + * caches such as secondary TLBs that may cache old read-only entries.
>
> This is a poor fit for invalidate_range_{start,end}(). All other uses bookend
> the primary MMU update, i.e. call start() _before_ changing PTEs. The comments
> are somewhat stale as they talk only about "unmapped", but the contract between
> the primary MMU and the secondary MMU is otherwise quite clear on when the primary
> MMU will invoke start() and end().
>
> * invalidate_range_start() is called when all pages in the
> * range are still mapped and have at least a refcount of one.
> *
> * invalidate_range_end() is called when all pages in the
> * range have been unmapped and the pages have been freed by
> * the VM.
>
> I'm also confused as to how this actually fixes ARM's SMMU. Unless I'm looking
> at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(),
> not the start()/end() variants.

mmu_invalidate_range_end() calls the invalidate_range() callback if
the start()/end() variants aren't set.

> static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
> .invalidate_range = arm_smmu_mm_invalidate_range,
> .release = arm_smmu_mm_release,
> .free_notifier = arm_smmu_mmu_notifier_free,
> };
>
> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
> is perfectly valid. And AFAICT, the existing invalidate_range() hook is pretty
> much a perfect fit for what you want to achieve.

Right, I didn't take that approach because it doesn't allow an event
type to be passed which would allow them to be filtered on platforms
which don't require this.

I had also assumed the invalidate_range() callbacks were allowed to
sleep, hence couldn't be called under PTL. That's certainly true of mmu
interval notifier callbacks, but Catalin reminded me that calls such as
ptep_clear_flush_notify() already call invalidate_range() callback under
PTL so I guess we already assume drivers don't sleep in their
invalidate_range() callbacks. I will update the comments to reflect
that.

> * If invalidate_range() is used to manage a non-CPU TLB with
> * shared page-tables, it not necessary to implement the
> * invalidate_range_start()/end() notifiers, as
> * invalidate_range() already catches the points in time when an
> * external TLB range needs to be flushed. For more in depth
> * discussion on this see Documentation/mm/mmu_notifier.rst
>
> Even worse, this change may silently regress performance for secondary MMUs that
> haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs
> in response to the upgrade, instead of waiting until the guest actually tries to
> utilize the new protections.

Yeah, I like the idea of introducing a
ptep_set_access_flags_notify(). That way this won't regress performance
on platforms that don't need it. Note this isn't a new feature but
rather a bugfix. It's unclear to me why KVM on ARM hasn't already run
into this issue, but I'm no KVM expert. Thanks for the feedback.

- Alistair