Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

From: Peter Xu
Date: Mon Sep 10 2018 - 00:08:18 EST


On Fri, Sep 07, 2018 at 01:54:35PM -0400, Jerome Glisse wrote:
> On Fri, Sep 07, 2018 at 12:35:24PM +0800, Peter Xu wrote:
> > On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > > > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > > >
> > > > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > > > > >> the original huge page has the dirty bit set before. Otherwise we'll
> > > > > > > >> lose the original dirty bit.
> > > > > > > >
> > > > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > > > >
> > > > > > > > if (pmd_dirty(old_pmd))
> > > > > > > > SetPageDirty(page);
> > > > > > > >
> > > > > > >
> > > > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page()
> > > > > > > propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail().
> > > > > >
> > > > > > Hi, Kirill, Zi,
> > > > > >
> > > > > > Thanks for your responses!
> > > > > >
> > > > > > Though in my test the huge page seems to be splitted not by
> > > > > > split_huge_page_to_list() but by explicit calls to
> > > > > > change_protection(). The stack looks like this (again, this is a
> > > > > > customized kernel, and I added an explicit dump_stack() there):
> > > > > >
> > > > > > kernel: dump_stack+0x5c/0x7b
> > > > > > kernel: __split_huge_pmd+0x192/0xdc0
> > > > > > kernel: ? update_load_avg+0x8b/0x550
> > > > > > kernel: ? update_load_avg+0x8b/0x550
> > > > > > kernel: ? account_entity_enqueue+0xc5/0xf0
> > > > > > kernel: ? enqueue_entity+0x112/0x650
> > > > > > kernel: change_protection+0x3a2/0xab0
> > > > > > kernel: mwriteprotect_range+0xdd/0x110
> > > > > > kernel: userfaultfd_ioctl+0x50b/0x1210
> > > > > > kernel: ? do_futex+0x2cf/0xb20
> > > > > > kernel: ? tty_write+0x1d2/0x2f0
> > > > > > kernel: ? do_vfs_ioctl+0x9f/0x610
> > > > > > kernel: do_vfs_ioctl+0x9f/0x610
> > > > > > kernel: ? __x64_sys_futex+0x88/0x180
> > > > > > kernel: ksys_ioctl+0x70/0x80
> > > > > > kernel: __x64_sys_ioctl+0x16/0x20
> > > > > > kernel: do_syscall_64+0x55/0x150
> > > > > > kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > >
> > > > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > > > > to kernel space, which is handled by mwriteprotect_range(). In case
> > > > > > you'd like to refer to the kernel, it's basically this one from
> > > > > > Andrea's (with very trivial changes):
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > > > >
> > > > > > So... do we have two paths to split the huge pages separately?
> > > > >
> > > > > We have two entiries that can be split: page table enties and underlying
> > > > > compound page.
> > > > >
> > > > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > > > > table. It doens't touch underlying compound page. The page still can be
> > > > > mapped in other place as huge.
> > > > >
> > > > > split_huge_page() (and ivariants of it) split compound page into a number
> > > > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > > > PMD, but not other way around.
> > > > >
> > > > > >
> > > > > > Another (possibly very naive) question is: could any of you hint me
> > > > > > how the page dirty bit is finally applied to the PTEs? These two
> > > > > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > > > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > > > > onto the real PTEs).
> > > > >
> > > > > Dirty bit from page table entries transferes to sturct page flug and used
> > > > > for decision making in reclaim path.
> > > >
> > > > Thanks for explaining. It's much clearer for me.
> > > >
> > > > Though for the issue I have encountered, I am still confused on why
> > > > that dirty bit can be ignored for the splitted PTEs. Indeed we have:
> > > >
> > > > if (pmd_dirty(old_pmd))
> > > > SetPageDirty(page);
> > > >
> > > > However to me this only transfers (as you explained above) the dirty
> > > > bit (AFAIU it's possibly set by the hardware when the page is written)
> > > > to the page struct of the compound page. It did not really apply to
> > > > every small page of the splitted huge page. As you also explained,
> > > > this __split_huge_pmd() only splits the PMD entry but it keeps the
> > > > compound huge page there, then IMHO it should also apply the dirty
> > > > bits from the huge page to all the small page entries, no?
> > >
> > > The bit on compound page represents all small subpages. PageDirty() on any
> > > subpage will return you true if the compound page is dirty.
> >
> > Ah I didn't notice this before (since PageDirty is defined with
> > PF_HEAD), thanks for pointing out.
> >
> > >
> > > > These dirty bits are really important to my scenario since AFAIU the
> > > > change_protection() call is using these dirty bits to decide whether
> > > > it should append the WRITE bit - it finally corresponds to the lines
> > > > in change_pte_range():
> > > >
> > > > /* Avoid taking write faults for known dirty pages */
> > > > if (dirty_accountable && pte_dirty(ptent) &&
> > > > (pte_soft_dirty(ptent) ||
> > > > !(vma->vm_flags & VM_SOFTDIRTY))) {
> > > > ptent = pte_mkwrite(ptent);
> > > > }
> > > >
> > > > So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
> > > > which is similar) although we pass in the new protocol with VM_WRITE
> > > > here it'll still mask it since the dirty bit is not set, then the
> > > > userspace program (in my case, the QEMU thread that handles write
> > > > protect failures) can never fixup the write-protected page fault.
> > >
> > > I don't follow here.
> > >
> > > The code you quoting above is an apportunistic optimization and should not
> > > be mission-critical. The dirty and writable bits can go away as soon as
> > > you drop page table lock for the page.
> >
> > Indeed it's an optimization, IIUC it tries to avoid an extra but
> > possibly useless write-protect page fault when the dirty bits are
> > already set after all. However that's a bit trickly here - in my use
> > case the write-protect page faults will be handled by one of the QEMU
> > thread that reads the userfaultfd handle, so the fault must be handled
> > there instead of inside kernel otherwise there'll be nested page
> > faults forever (and userfaultfd will detect that then send a SIGBUS
> > instead).
> >
> > I'll try to explain with some more details on how I understand what
> > happened. This should also answer Zi's question so I'll avoid
> > replying twice there. Please feel free to correct me.
> >
> > Firstly, below should be the correct steps to handle a userspace write
> > protect page fault using Andrea's userfault-wp tree (I only mentioned
> > the page fault steps and ignored most of the irrelevant setup
> > procedures):
> >
> > 1. QEMU write-protects page P using UFFDIO_WRITEPROTECT ioctl, then
> > the write bit removed from PTE, so QEMU can capture any further
> > writes to the page
> >
> > ... (time passes)...
>
> UFFDIO_WRITEPROTECT with UFFDIO_WRITEPROTECT_MODE_WP
>
> >
> > 2. [vCPU thread] Guest writes to the page P, trigger wp page fault
> >
> > 3. [vCPU thread] Since the page (and the whole vma) is tracked by
> > userfault-wp, it goes into handle_userfault() to notify userspace
> > about the page fault and waits...
> >
> > 4. [userfault thread] Gets the message about the page fault, do
> > anything it like with the page P (mostly copy it somewhere), and
> > fixup the page fault by another UFFDIO_WRITEPROTECT ioctl, this
> > time to reset the write bit. After that, it'll wake up the vCPU
> > thread
>
> UFFDIO_WRITEPROTECT with !UFFDIO_WRITEPROTECT_MODE_WP
>
> It confused me when looking at code:
> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=aa97daa6e54f2cfed1a6f1f38f9629608b8aadcc
>
> >
> > 5. [vCPU thread] Got waked up, retry the page fault by returning a
> > VM_FAULT_RETRY in handle_userfault(). Then this time we'll see the
> > PTE with write bit set correctly. vCPU continues execution.
> >
> > Then let's consider THP here, where we might miss the dirty page for
> > the PTE of the small page P. In that case at step (4) when we want to
> > recover the write bit we'll fail since the dirty bit is missing in the
> > small PTE, so the write bit will still be cleared (expecting that the
> > next page fault will fill it up). However in step (5) we can't really
> > fill in the write bit since we'll fault again into the
> > handle_userfault() before that happens and then it goes back to step
> > (3) then it can actualy loop forever if without the loop detection
> > code in handle_userfault().
> >
> > So I think now I understand that setting up the dirty bit in the
> > compound page should be enough, then would below change acceptable
> > instead?
> >
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 6d331620b9e5..0d4a8129a5e7 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -73,6 +73,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > if (pte_present(oldpte)) {
> > pte_t ptent;
> > bool preserve_write = prot_numa && pte_write(oldpte);
> > + bool dirty;
> >
> > /*
> > * Avoid trapping faults against the zero or KSM
> > @@ -115,8 +116,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > if (preserve_write)
> > ptent = pte_mk_savedwrite(ptent);
> >
> > + /*
> > + * The extra PageDirty() check will make sure
> > + * we'll capture the dirty page even if the
> > + * PTE dirty bit unset. One case is when the
> > + * PTE is splitted from a huge PMD, in that
> > + * case the dirty flag might only be set on
> > + * the compound page instead of this PTE.
> > + */
> > + dirty = pte_dirty(ptent) || PageDirty(pte_page(ptent));
> > +
> > /* Avoid taking write faults for known dirty pages */
> > - if (dirty_accountable && pte_dirty(ptent) &&
> > + if (dirty_accountable && dirty &&
> > (pte_soft_dirty(ptent) ||
> > !(vma->vm_flags & VM_SOFTDIRTY))) {
> > ptent = pte_mkwrite(ptent);
> >
> > I tested that this change can also fix my problem (QEMU will not get
> > SIGBUS after write protection starts).
>
> This is wrong mwriteprotect_range() should already properly set pte
> entry to non write protect:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=b16cb9fcb76bec59cbe1427e73246dc81a4942e2
>
> if (enable_wp)
> newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
> else
> newprot = vm_get_page_prot(dst_vma->vm_flags);
>
> So it seems that the vm_flags do not have VM_WRITE set.

Hi, Jerome,

I think the vma has correct VM_WRITE flag there. I added some prints
into mwriteprotect_range() to trap more information when coredump
happens:

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d3d0a13a636f..ebdcd76887df 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -606,6 +606,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
struct vm_area_struct *dst_vma;
pgprot_t newprot;
int err;
+ unsigned long pages;

/*
* Sanitize the command parameters:
@@ -638,13 +639,17 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
if (!vma_is_anonymous(dst_vma))
goto out_unlock;

+ pr_info("%s: vm_flags: 0x%lx\n", __func__, dst_vma->vm_flags);
+
if (enable_wp)
newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
else
newprot = vm_get_page_prot(dst_vma->vm_flags);

- change_protection(dst_vma, start, start + len, newprot,
- !enable_wp, 0);
+ pages = change_protection(dst_vma, start, start + len, newprot,
+ !enable_wp, 0);
+ pr_info("%s: 0x%lx-0x%lx changed %ld pages (newprot=%lx, wp=%d)\n",
+ __func__, start, start + len, pages, newprot.pgprot, enable_wp);

err = 0;
out_unlock:

Here's what I got starting from when QEMU starts until QEMU coredumps:

Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: vm_flags: 0xa0121073
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: 0x7f205fe00000-0x7f209fe00000 changed 1024 pages (newprot=8000000000000025, wp=1)
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: vm_flags: 0xa0121073
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: 0x7f20aa000000-0x7f20ab000000 changed 512 pages (newprot=8000000000000025, wp=1)
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: vm_flags: 0xa0121073 <----------------------------------- [1]
Sep 10 11:40:24 px-ws kernel: mwriteprotect_range: 0x7f205fe7f000-0x7f205fe80000 changed 1 pages (newprot=8000000000000025, wp=0)
Sep 10 11:40:24 px-ws kernel: FAULT_FLAG_ALLOW_RETRY missing 71
Sep 10 11:40:24 px-ws kernel: CPU: 7 PID: 1637 Comm: qemu-system-x86 Not tainted 4.19.0-rc2+ #27
Sep 10 11:40:24 px-ws kernel: Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC6AUS 06/22/2016
Sep 10 11:40:24 px-ws kernel: Call Trace:
Sep 10 11:40:24 px-ws kernel: dump_stack+0x5c/0x7b
Sep 10 11:40:24 px-ws kernel: handle_userfault+0x4b5/0x780
Sep 10 11:40:24 px-ws kernel: ? schedule+0x32/0x80
Sep 10 11:40:24 px-ws kernel: ? handle_userfault+0x47e/0x780
Sep 10 11:40:24 px-ws kernel: do_wp_page+0x1bd/0x5a0
Sep 10 11:40:24 px-ws kernel: __handle_mm_fault+0x7f9/0x1250
Sep 10 11:40:24 px-ws kernel: handle_mm_fault+0xfc/0x1f0
Sep 10 11:40:24 px-ws kernel: __do_page_fault+0x255/0x520
Sep 10 11:40:24 px-ws kernel: do_page_fault+0x32/0x110
Sep 10 11:40:24 px-ws kernel: ? page_fault+0x8/0x30
Sep 10 11:40:24 px-ws kernel: page_fault+0x1e/0x30
Sep 10 11:40:24 px-ws kernel: RIP: 0033:0x7f20ac9108c9
Sep 10 11:40:24 px-ws kernel: Code: 75 03 c1 ef 07 48 81 e6 00 f0 ff ff 81 e7 e0 1f 00 00 49 8d bc 3e 40 57 00 00 48 3b 37 48 8b f5 0f 85 32 00 00 00 48 03 77 10 <89> 1e 49f
Sep 10 11:40:24 px-ws kernel: RSP: 002b:00007f20abdda390 EFLAGS: 00010202
Sep 10 11:40:24 px-ws kernel: RAX: 00007f20ac915b80 RBX: 000000003fec3baa RCX: 0000000000007318
Sep 10 11:40:24 px-ws kernel: RDX: 0000000000000000 RSI: 00007f205fe7fed0 RDI: 000055cc4efef980
Sep 10 11:40:24 px-ws kernel: RBP: 000000000007fed0 R08: 0000000000000000 R09: 0000000000000007
Sep 10 11:40:24 px-ws kernel: R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000242
Sep 10 11:40:24 px-ws kernel: R13: 0000000000000000 R14: 000055cc4efe9260 R15: 00007f20abddb700

The first four lines of mwriteprotect_range() are trying to set things
up (they have wp=1) which seems fine. Note that lines 5-6 of
mwriteprotect_range() entry (marked with [1]) is the wp=0 one where
QEMU tries to recover a write protect page fault for a 4K page. We
can see that the vm_flags has VM_WRITE properly set (bit 2 of
0xa0121073) rather than missing.

Though we can see the newprot didn't really have that VM_WRITE set but
it's expected since in vm_get_page_prot (further, which is actually
protection_map) we'll drop that write bit due to the protect mapping
(READ+WRITE will map to PAGE_COPY, which does not have VM_WRITE set).

>
> To me this all points out so a bug somewhere in userspace or a miss use
> of userfaultfd. Here is what i believe to be the chain of event:
>
> 1 QEMU or vCPU write to the affected ufaultfd range and this set the pte
> dirty bit on all the entry in the affected range
>
> ...
>
> 2 QEMU write protect the affected range with UFFDIO_WRITEPROTECT and
> UFFDIO_WRITEPROTECT_MODE_WP flag set. This clear the pte write bit
> and thus write protect the range. Because it is anonymous memory and
> soft dirty is likely disabled, the dirty bit set in 1 is still there
> and is preserved.
>
> 3 vCPU tries to write to the affected range. This trigger a userfaultfd
> and QEMU handle it and call UFFDIO_WRITEPROTECT but this time without
> UFFDIO_WRITEPROTECT_MODE_WP flag (ie to unprotect).
>
> For some reasons the affected vma do not have the VM_WRITE flags set
> anymore probably through mprotect() syscall by QEMU. So that the new
> prot for the pte do not have the write bit set.

Please refers to [1] above. The crash happens stably every time if
without my fix patch applied.

>
> But because of the change_pte_range() optimization and because the
> pte dirty bit is set from 1 then the pte write bit set which is wrong
> as the VM_WRITE have been clear.

I actually have dumped the dirty flag there too and it's missing, and
that's why I think we should have the bit. It's indeed a bit awkward
at least to me since when running with the userfault-wp tree the dirty
bit optimization becomes more like a correctness issue rather than a
performance issue.

>
>
> So hence there is a bug in QEMU somewhere is my best guess.
>
> Note that this means that the:
> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=b16cb9fcb76bec59cbe1427e73246dc81a4942e2
>
> Needs to be updated with:
> -change_protection(dst_vma, start, start + len, newprot, !enable_wp, 0);
> +change_protection(dst_vma, start, start + len, newprot, 0, 0);

I'm not sure about this suggestion as well - if we keep the
change_pte_range() with dirty_accountable=false, then how could we
further apply the WM_WRITE flag at all with current implementation
(note that with the dirty bit optimization, we'll just ignore the
VM_WRITE bit if dirty_accountable is false...)? If without VM_WRITE,
how could we fix a write-protect page fault after all from userspace?

Please correct me if I missed anything important.

Thanks!

--
Peter Xu