Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count

From: Laurent Dufour
Date: Mon Nov 05 2018 - 13:23:12 EST


Le 05/11/2018 Ã 08:04, vinayak menon a ÃcritÂ:
Hi Laurent,

On Thu, May 17, 2018 at 4:37 PM Laurent Dufour
<ldufour@xxxxxxxxxxxxxxxxxx> wrote:

The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx>
---
fs/proc/task_mmu.c | 5 ++++-
fs/userfaultfd.c | 17 +++++++++++++----
mm/khugepaged.c | 3 +++
mm/madvise.c | 6 +++++-
mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++++++-----------------
mm/mlock.c | 13 ++++++++-----
mm/mmap.c | 22 +++++++++++++---------
mm/mprotect.c | 4 +++-
mm/swap_state.c | 8 ++++++--
9 files changed, 89 insertions(+), 40 deletions(-)

struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct vm_fault *vmf)
@@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
unsigned long *start,
unsigned long *end)
{
- *start = max3(lpfn, PFN_DOWN(vma->vm_start),
+ *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
PFN_DOWN(faddr & PMD_MASK));
- *end = min3(rpfn, PFN_DOWN(vma->vm_end),
+ *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
}

--
2.7.4


I have got a crash on 4.14 kernel with speculative page faults enabled
and here is my analysis of the problem.
The issue was reported only once.

Hi Vinayak,

Thanks for reporting this.


[23409.303395] el1_da+0x24/0x84
[23409.303400] __radix_tree_lookup+0x8/0x90
[23409.303407] find_get_entry+0x64/0x14c
[23409.303410] pagecache_get_page+0x5c/0x27c
[23409.303416] __read_swap_cache_async+0x80/0x260
[23409.303420] swap_vma_readahead+0x264/0x37c
[23409.303423] swapin_readahead+0x5c/0x6c
[23409.303428] do_swap_page+0x128/0x6e4
[23409.303431] handle_pte_fault+0x230/0xca4
[23409.303435] __handle_speculative_fault+0x57c/0x7c8
[23409.303438] do_page_fault+0x228/0x3e8
[23409.303442] do_translation_fault+0x50/0x6c
[23409.303445] do_mem_abort+0x5c/0xe0
[23409.303447] el0_da+0x20/0x24

Process A accesses address ADDR (part of VMA A) and that results in a
translation fault.
Kernel enters __handle_speculative_fault to fix the fault.
Process A enters do_swap_page->swapin_readahead->swap_vma_readahead
from speculative path.
During this time, another process B which shares the same mm, does a
mprotect from another CPU which follows
mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B.
After the split, ADDR falls into VMA B, but process A is still using
VMA A.
Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end.
swap_vma_readahead->swap_ra_info uses start and end of vma to
calculate ptes and nr_pte, which goes wrong due to this and finally
resulting in wrong "entry" passed to
swap_vma_readahead->__read_swap_cache_async, and in turn causing
invalid swapper_space
being passed to __read_swap_cache_async->find_get_page, causing an abort.

The fix I have tried is to cache vm_start and vm_end also in vmf and
use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can
send
the patch I am a using if you feel that is the right thing to do.

I think the best would be to don't do swap readahead during the speculatvive page fault. If the page is found in the swap cache, that's fine, but otherwise, we should f allback to the regular page fault.

The attached -untested- patch is doing this, if you want to give it a try. I'll review that for the next series.

Thanks,
Laurent.
From 056afafb0bccea6a356f80f4253ffcd3ef4a1f8d Mon Sep 17 00:00:00 2001
From: Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx>
Date: Mon, 5 Nov 2018 18:43:01 +0100
Subject: [PATCH] mm: don't do swap readahead during speculative page fault

Vinayak Menon faced a panic because one thread was page faulting a page in
swap, while another one was mprotecting a part of the VMA leading to a VMA
split.
This raise a panic in swap_vma_readahead() because the VMA's boundaries
were not more matching the faulting address.

To avoid this, if the page is not found in the swap, the speculative page
fault is aborted to retry a regular page fault.

Signed-off-by: Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx>
---
mm/memory.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9dd5ffeb1f7e..720dc9a1b99f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3139,6 +3139,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
lru_cache_add_anon(page);
swap_readpage(page, true);
}
+ } else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ /*
+ * Don't try readahead during a speculative page fault as
+ * the VMA's boundaries may change in our back.
+ * If the page is not in the swap cache and synchronous read
+ * is disabled, fall back to the regular page fault mechanism.
+ */
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ ret = VM_FAULT_RETRY;
+ goto out;
} else {
page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
vmf);
--
2.19.1