Re: f45ec5ff16 ("userfaultfd: wp: support swap and page migration"): [ 140.777858] BUG: Bad rss-counter state mm:b278fc66 type:MM_ANONPAGES val:1
From: Peter Xu
Date: Fri Apr 10 2020 - 11:33:28 EST
Hi, Hillf,
On Fri, Apr 10, 2020 at 03:32:09PM +0800, Hillf Danton wrote:
>
> On Fri, 10 Apr 2020 08:25:18 +0800
> > Greetings,
> >
> > 0day kernel testing robot got the below dmesg and the first bad commit is
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >
> > commit f45ec5ff16a75f96dac8c89862d75f1d8739efd4
> > Author: Peter Xu <peterx@xxxxxxxxxx>
> > AuthorDate: Mon Apr 6 20:06:01 2020 -0700
> > Commit: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> > CommitDate: Tue Apr 7 10:43:39 2020 -0700
> >
> > userfaultfd: wp: support swap and page migration
> >
> > For either swap and page migration, we all use the bit 2 of the entry to
> > identify whether this entry is uffd write-protected. It plays a similar
> > role as the existing soft dirty bit in swap entries but only for keeping
> > the uffd-wp tracking for a specific PTE/PMD.
> >
> > Something special here is that when we want to recover the uffd-wp bit
> > from a swap/migration entry to the PTE bit we'll also need to take care of
> > the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> > _PAGE_UFFD_WP bit we can't trap it at all.
> >
> > In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
> > That can lead to data mismatch if the page that we are going to write
> > protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
> > also applies/removes the uffd-wp bit even for the swap entries.
> >
> Have trouble understanding the last sentence in the paragraph above and
> particularly linking it to the first one.
This patch handles the swap entry tracking for userfault-wp. I wanted
to express the fact that we didn't do that before. Sorry if my
wording is confusing.
>
> > If you fix the issue, kindly add following tag
> > Reported-by: kernel test robot <lkp@xxxxxxxxx>
> >
> > [child3:925] eventfd (323) returned ENOSYS, marking as inactive.
> > [ 132.014801] can: request_module (can-proto-2) failed.
> > [ 132.063717] can: request_module (can-proto-2) failed.
> > [ 137.186037] trinity-c2 (943) used greatest stack depth: 5804 bytes left
> > [ 140.771486] MCE: Killing trinity-c2:956 due to hardware memory corruption fault at 8bd2060
> > [ 140.777858] BUG: Bad rss-counter state mm:b278fc66 type:MM_ANONPAGES val:1
> > [ 140.778736] BUG: Bad rss-counter state mm:b278fc66 type:MM_SHMEMPAGES val:2
> > [ 141.589424] MCE: Killing trinity-c3:940 due to hardware memory corruption fault at 8a8c860
> > [ 141.590730] swap_info_get: Bad swap file entry 700b8216
Should be a 32bit guest, swap type == 11100b. I guess this also means
MAX_SWAPFILES is bigger than 11100b.
> > [ 141.591400] BUG: Bad page map in process trinity-c3 pte:17042c3c pmd:b1809067
> > [ 141.592304] addr:08bcf000 vm_flags:00100073 anon_vma:f1f29528 mapping:00000000 index:8bcf
> > [ 141.593399] file:(null) fault:0x0 mmap:0x0 readpage:0x0
> > [ 141.594065] CPU: 0 PID: 940 Comm: trinity-c3 Not tainted 5.6.0-11490-gf45ec5ff16a75 #1
> > [ 141.595055] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > [ 141.596093] Call Trace:
> > [ 141.596443] dump_stack+0x16/0x18
> > [ 141.596868] print_bad_pte+0x13f/0x159
> > [ 141.597367] unmap_page_range+0x2a7/0x3e7
> > [ 141.597893] unmap_single_vma+0x53/0x5d
> > [ 141.598383] unmap_vmas+0x2c/0x3b
> > [ 141.598811] exit_mmap+0x81/0xfc
> > [ 141.599238] __mmput+0x25/0x8d
> > [ 141.599633] mmput+0x28/0x2b
> > [ 141.600007] do_exit+0x2f0/0x84a
> > [ 141.600449] ? ___might_sleep+0x3f/0x11f
> > [ 141.600949] do_group_exit+0x86/0x86
> > [ 141.601421] __ia32_sys_exit_group+0x15/0x15
> > [ 141.601965] do_fast_syscall_32+0x86/0xbf
> > [ 141.602481] entry_SYSENTER_32+0xaf/0x101
>
>
> Because is_swap_pte(oldpte) != IS_ENABLED(CONFIG_MIGRATION)), restore the
> old behavior by modifying uffd_wp only for pte that is __not__ swap entry,
> as the commit log says.
>
> --- b/mm/mprotect.c
> +++ c/mm/mprotect.c
> @@ -139,7 +139,7 @@ static unsigned long change_pte_range(st
> }
> ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> pages++;
> - } else if (is_swap_pte(oldpte)) {
> + } else if (IS_ENABLED(CONFIG_MIGRATION)) {
> swp_entry_t entry = pte_to_swp_entry(oldpte);
> pte_t newpte;
>
> @@ -154,7 +154,9 @@ static unsigned long change_pte_range(st
> newpte = pte_swp_mksoft_dirty(newpte);
> if (pte_swp_uffd_wp(oldpte))
> newpte = pte_swp_mkuffd_wp(newpte);
> - } else if (is_write_device_private_entry(entry)) {
> + }
> +
> + if (is_write_device_private_entry(entry)) {
> /*
> * We do not preserve soft-dirtiness. See
> * copy_one_pte() for explanation.
> @@ -163,11 +165,18 @@ static unsigned long change_pte_range(st
> newpte = swp_entry_to_pte(entry);
> if (pte_swp_uffd_wp(oldpte))
> newpte = pte_swp_mkuffd_wp(newpte);
> - } else {
> - newpte = oldpte;
> }
>
> - if (uffd_wp)
> + /*
> + * do nothing for changing uffd_wp if oldpte is a
> + * swap entry.
> + * That can lead to data mismatch if the page we
> + * are going to write protect is swapped out when
> + * sending the UFFDIO_WRITEPROTECT.
> + */
> + if (is_swap_pte(oldpte))
> + newpte = oldpte;
> + else if (uffd_wp)
> newpte = pte_swp_mkuffd_wp(newpte);
> else if (uffd_wp_resolve)
> newpte = pte_swp_clear_uffd_wp(newpte);
I'm not sure this is correct. As I mentioned, the commit wanted to
apply the uffd-wp bit even for the swap entries so that even the swap
entries got swapped in, the page will still be write protected. So
IIUC think we can't remove that.
Above report happened at __swap_info_get() where we should have
triggered type >= READ_ONCE(nr_swapfiles) in swap_type_to_swap_info(),
which, iiuc, means the type is considered wrong. The mistery to me is
how did the swap type went wrong. E.g., iiuc the
"is_swap_pte(oldpte)" replacement to "IS_ENABLED(CONFIG_MIGRATION)"
shouldn't affect this, because even if MIGRATION is not enabled, both
is_write_migration_entry() and is_write_device_private_entry() should
return constant zero after all, so they shouldn't be thouching the
swap type.
I'm still trying to digest on what's happened... It would be good too
if more information on the test could be given, e.g., what is the
behavior of trinity-c2. A reproducer is of course even better.
Thanks,
--
Peter Xu