Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

From: David Hildenbrand
Date: Wed Mar 31 2021 - 10:15:37 EST


On 31.03.21 12:41, Steven Price wrote:
On 31/03/2021 10:32, David Hildenbrand wrote:
On 31.03.21 11:21, Catalin Marinas wrote:
On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
On 30.03.21 12:30, Catalin Marinas wrote:
On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
On 28/03/2021 13:21, Catalin Marinas wrote:
On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..b31b7a821f90 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu
*vcpu, phys_addr_t fault_ipa,
        if (vma_pagesize == PAGE_SIZE && !force_pte)
            vma_pagesize = transparent_hugepage_adjust(memslot,
hva,
                                   &pfn, &fault_ipa);
+
+    if (fault_status != FSC_PERM && kvm_has_mte(kvm) &&
pfn_valid(pfn)) {
+        /*
+         * VM will be able to see the page's tags, so we must
ensure
+         * they have been initialised. if PG_mte_tagged is set,
tags
+         * have already been initialised.
+         */
+        struct page *page = pfn_to_page(pfn);
+        unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+
+        for (i = 0; i < nr_pages; i++, page++) {
+            if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+                mte_clear_page_tags(page_address(page));
+        }
+    }

This pfn_valid() check may be problematic. Following commit
eeb0753ba27b
("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it
returns
true for ZONE_DEVICE memory but such memory is allowed not to
support
MTE.

Some more thinking, this should be safe as any ZONE_DEVICE would be
mapped as untagged memory in the kernel linear map. It could be
slightly
inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
untagged memory. Another overhead is pfn_valid() which will likely
end
up calling memblock_is_map_memory().

However, the bigger issue is that Stage 2 cannot disable tagging for
Stage 1 unless the memory is Non-cacheable or Device at S2. Is
there a
way to detect what gets mapped in the guest as Normal Cacheable
memory
and make sure it's only early memory or hotplug but no ZONE_DEVICE
(or
something else like on-chip memory)?  If we can't guarantee that all
Cacheable memory given to a guest supports tags, we should disable
the
feature altogether.

In stage 2 I believe we only have two types of mapping - 'normal' or
DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the
latter is a
case of checking the 'device' variable, and makes sense to avoid the
overhead you describe.

This should also guarantee that all stage-2 cacheable memory
supports tags,
as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid()
should only
be true for memory that Linux considers "normal".

If you think "normal" == "normal System RAM", that's wrong; see below.

By "normal" I think both Steven and I meant the Normal Cacheable memory
attribute (another being the Device memory attribute).

Sadly there's no good standardised terminology here. Aarch64 provides
the "normal (cacheable)" definition. Memory which is mapped as "Normal
Cacheable" is implicitly MTE capable when shared with a guest (because
the stage 2 mappings don't allow restricting MTE other than mapping it
as Device memory).

So MTE also forces us to have a definition of memory which is "bog
standard memory"[1] separate from the mapping attributes. This is the
main memory which fully supports MTE.

Separate from the "bog standard" we have the "special"[1] memory, e.g.
ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but
that memory may not support MTE tags. This memory can only be safely
shared with a guest in the following situations:

1. MTE is completely disabled for the guest

2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)

3. We have some guarantee that guest MTE access are in some way safe.

(1) is the situation today (without this patch series). But it prevents
the guest from using MTE in any form.

(2) is pretty terrible for general memory, but is the get-out clause for
mapping devices into the guest.

(3) isn't something we have any architectural way of discovering. We'd
need to know what the device did with the MTE accesses (and any caches
between the CPU and the device) to ensure there aren't any side-channels
or h/w lockup issues. We'd also need some way of describing this memory
to the guest.

So at least for the time being the approach is to avoid letting a guest
with MTE enabled have access to this sort of memory.

[1] Neither "bog standard" nor "special" are real terms - like I said
there's a lack of standardised terminology.

That's the problem. With Anshuman's commit I mentioned above,
pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
memory, not talking about some I/O mapping that requires Device_nGnRE).
So kvm_is_device_pfn() is false for such memory and it may be mapped as
Normal but it is not guaranteed to support tagging.

pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
touch the page, you won't fault. So Anshuman's commit is correct.

I agree.

pfn_to_online_page() means, "there is a struct page and it's system RAM
that's in use; the memmap has a sane content"

Does pfn_to_online_page() returns a valid struct page pointer for
ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
some definition of system RAM (I assume NVDIMM != system RAM). For
example, pmem_attach_disk() calls devm_memremap_pages() and this would
use the Normal Cacheable memory attribute without necessarily being
system RAM.

No, not for ZONE_DEVICE.

However, if you expose PMEM via dax/kmem as System RAM to the system (->
add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or
ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system
thinks it's "ordinary system RAM" and the memory is managed by the buddy.

So if I'm understanding this correctly for KVM we need to use
pfn_to_online_pages() and reject if NULL is returned? In the case of
dax/kmem there already needs to be validation that the memory supports
MTE (otherwise we break user space) before it's allowed into the
"ordinary system RAM" bucket.

That should work.

1. One alternative is

if (!pfn_valid(pfn))
return false;
#ifdef CONFIG_ZONE_DEVICE
page = pfn_to_page(pfn);
if (page_zonenum(page) == ZONE_DEVICE)
return false;
#endif
return true;


Note that when you are dealing with random PFNs, this approach is in general not safe; the memmap could be uninitialized and contain garbage. You can have false positives for ZONE_DEVICE.


2. Yet another (slower?) variant to detect (some?) ZONE_DEVICE is

pgmap = get_dev_pagemap(pfn, NULL);
put_dev_pagemap(pgmap);

if (pgmap)
return false;
return true;



I know that /dev/mem mappings can be problematic ... because the memmap could be in any state and actually we shouldn't even touch/rely on any "struct pages" at all, as we have a pure PFN mapping ...

--
Thanks,

David / dhildenb