Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature

From: Steven Price
Date: Wed May 12 2021 - 12:46:24 EST


On 10/05/2021 19:35, Catalin Marinas wrote:
On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
On 04/05/2021 18:40, Catalin Marinas wrote:
On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
Given the changes to set_pte_at() which means that tags are restored from
swap even if !PROT_MTE, the only race I can see remaining is the creation of
new PROT_MTE mappings. As you mention an attempt to change mappings in the
VMM memory space should involve a mmu notifier call which I think serialises
this. So the remaining issue is doing this in a separate address space.

So I guess the potential problem is:

* allocate memory MAP_SHARED but !PROT_MTE
* fork()
* VM causes a fault in parent address space
* child does a mprotect(PROT_MTE)

With the last two potentially racing. Sadly I can't see a good way of
handling that.

Ah, the mmap lock doesn't help as they are different processes
(mprotect() acquires it as a writer).

I wonder whether this is racy even in the absence of KVM. If both parent
and child do an mprotect(PROT_MTE), one of them may be reading stale
tags for a brief period.

Maybe we should revisit whether shared MTE pages are of any use, though
it's an ABI change (not bad if no-one is relying on this). However...
[...]
Thinking about this, we have a similar problem with the PG_dcache_clean
and two processes doing mprotect(PROT_EXEC). One of them could see the
flag set and skip the I-cache maintenance while the other executes
stale instructions. change_pte_range() could acquire the page lock if
the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
solve the MTE/KVM case but we could at least take the page lock via
user_mem_abort().
[...]
This is the real issue I see - the race in PROT_MTE case is either a data
leak (syncing after setting the bit) or data loss (syncing before setting
the bit).
[...]
But without serialising through a spinlock (in mte_sync_tags()) I haven't
been able to come up with any way of closing the race. But with the change
to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
is likely to seriously hurt performance.

Yeah. We could add another page flag as a lock though I think it should
be the core code that prevents the race.

If we are to do it in the arch code, maybe easier with a custom
ptep_modify_prot_start/end() where we check if it's VM_SHARED and
VM_MTE, take a (big) lock.

I think in the general case we don't even need VM_SHARED. For example,
we have two processes mapping a file, read-only. An mprotect() call in
both processes will race on the page->flags via the corresponding
set_pte_at(). I think an mprotect() with a page fault in different
processes can also race.

The PROT_EXEC case can be easily fixed, as you said already. The
PROT_MTE with MAP_PRIVATE I think can be made safe by a similar
approach: test flag, clear tags, set flag. A subsequent write would
trigger a CoW, so different page anyway.

Anyway, I don't think ptep_modify_prot_start/end would buy us much, it
probably makes the code even harder to read.

In the core code, something like below (well, a partial hack, not tested
and it doesn't handle huge pages but just to give an idea):

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..6ba96ff141a6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
oldpte = ptep_modify_prot_start(vma, addr, pte);
+ if (vma->vm_flags & VM_SHARED) {
+ page = vm_normal_page(vma, addr, oldpte);
+ lock_page(page);
+ }
ptent = pte_modify(oldpte, newprot);
if (preserve_write)
ptent = pte_mk_savedwrite(ptent);
@@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+ if (page)
+ unlock_page(page);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);

That's bogus: lock_page() might sleep but this whole code sequence is
under the ptl spinlock. There are some lock_page_* variants but that
would involve either a busy loop on this path or some bailing out,
waiting for a release.

Options:

1. Change the mte_sync_tags() code path to set the flag after clearing
and avoid reading stale tags. We document that mprotect() on
MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
arch code and return an error.

This is the best option I've come up with so far - but it's not a good
one! We can replace the set_bit() with a test_and_set_bit() to catch the
race after it has occurred - but I'm not sure what we can do about it
then (we've already wiped the data). Returning an error doesn't seem
particularly useful at that point, a message in dmesg is about the best
I can come up with.

2. Figure out some other locking in the core code. However, if
mprotect() in one process can race with a handle_pte_fault() in
another, on the same shared mapping, it's not trivial.
filemap_map_pages() would take the page lock before calling
do_set_pte(), so mprotect() would need the same page lock.

I can't see how this is going to work without harming the performance of
non-MTE work. Ultimately we're trying to add some sort of locking for
two (mostly) unrelated processes doing page table operations, which will
hurt scalability.

3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
set it around the other PG_arch_* bit setting).

This is certainly tempting, although sadly the existing
wait_on_page_bit() is sleeping - so this would either be a literal spin,
or we'd need to implement a new non-sleeping wait mechanism.

I ran out of ideas.


4. Sledgehammer locking in mte_sync_page_tags(), add a spinlock only for
the MTE case where we have to sync tags (see below). What the actual
performance impact of this is I've no idea. It could certainly be bad
if there are a lot of pages with MTE enabled, which sadly is exactly
the case if KVM is used with MTE :(

Steve

--->8----
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 0d320c060ebe..389ad40256f6 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,23 +25,33 @@
u64 gcr_kernel_excl __ro_after_init;
static bool report_fault_once = true;
+static spinlock_t tag_sync_lock;
static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
bool pte_is_tagged)
{
pte_t old_pte = READ_ONCE(*ptep);
+ if (!is_swap_pte(old_pte) && !pte_is_tagged)
+ return;
+
+ spin_lock_irqsave(&tag_sync_lock, flags);
+
+ /* Recheck with the lock held */
+ if (test_bit(PG_mte_tagged, &page->flags))
+ goto out;
+
if (check_swap && is_swap_pte(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);
if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
set_bit(PG_mte_tagged, &page->flags);
- return;
+ goto out;
}
}
if (!pte_is_tagged)
- return;
+ goto out;
page_kasan_tag_reset(page);
/*
@@ -55,6 +65,9 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
mte_clear_page_tags(page_address(page));
set_bit(PG_mte_tagged, &page->flags);
+
+out:
+ spin_unlock_irqrestore(&tag_sync_lock, flags);
}
void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -63,6 +76,11 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
bool pte_is_tagged = pte_tagged(pte);
+ unsigned long flags;
+
+ /* Early out if there's nothing to do */
+ if (!check_swap && !pte_is_tagged)
+ return;
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {