Re: [BUG] vfio device assignment regression with THP ref counting redesign
From: Alex Williamson
Date: Thu Apr 28 2016 - 14:58:14 EST
On Thu, 28 Apr 2016 21:17:26 +0300
"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> wrote:
> On Thu, Apr 28, 2016 at 10:20:51AM -0600, Alex Williamson wrote:
> > Hi,
> >
> > vfio-based device assignment makes use of get_user_pages_fast() in order
> > to pin pages for mapping through the iommu for userspace drivers.
> > Until the recent redesign of THP reference counting in the v4.5 kernel,
> > this all worked well. Now we're seeing cases where a sanity test
> > before we release our "pinned" mapping results in a different page
> > address than what we programmed into the iommu. So something is
> > occurring which pretty much negates the pinning we're trying to do.
> >
> > The test program I'm using is here:
> >
> > https://github.com/awilliam/tests/blob/master/vfio-iommu-map-unmap.c
> >
> > Apologies for lack of makefile, simply build with gcc -o <out> <in.c>.
> >
> > To run this, enable the IOMMU on your system - enable in BIOS plus add
> > intel_iommu=on to the kernel commandline (only Intel x86_64 tested).
> >
> > Pick a target PCI device, it doesn't matter what it is, the test only
> > needs a device for the purpose of creating an iommu domain, the device
> > is never actually touched. In my case I use a spare NIC at 00:19.0.
> > libvirt tools are useful for setting this up, simply run 'virsh
> > nodedev-detach pci_0000_00_19_0'. Otherwise bind the device manually
> > to vfio-pci using the standard new_id bind (ask, I can provide
> > instructions).
> >
> > I also tweak THP scanning to make sure it is actively trying to
> > collapse pages:
> >
> > echo always > /sys/kernel/mm/transparent_hugepage/defrag
> > echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> > echo 65536 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> >
> > Run the test with 'vfio-iommu-map-unmap 0000:00:19.0', or your chosen
> > target device.
> >
> > Of course to see that the mappings are moving, we need additional
> > sanity testing in the vfio iommu driver. For that:
> >
> > https://github.com/awilliam/linux-vfio/commit/379f324e3629349a7486018ad1cc5d4877228d1e
> >
> > When we map memory for vfio, we use get_user_pages_fast() on the
> > process vaddr to give us a page. page_to_pfn() then gives us the
> > physical memory address which we program into the iommu. Obviously we
> > expect this mapping to be stable so long as we hold the page
> > reference. On unmap we generally retrieve the physical memory address
> > from the iommu, convert it back to a page, and release our reference to
> > it. The debug code above adds an additional sanity test where on unmap
> > we also call get_user_pages_fast() again before we're released the
> > mapping reference and compare whether the physical page address still
> > matches what we previously stored in the iommu. On a v4.4 kernel this
> > works every time. On v4.5+, we get mismatches in dmesg within a few
> > lines of output from the test program.
> >
> > It's difficult to bisect around the THP reference counting redesign
> > since THP becomes disabled for much of it. I have discovered that this
> > commit is a significant contributor:
> >
> > 1f25fe2 mm, thp: adjust conditions when we can reuse the page on WP fault
> >
> > Particularly the middle chunk in huge_memory.c. Reverting this change
> > alone significantly improves the problem, but does not lead to a stable
> > system.
> >
> > I'm not an mm expert, so I'm looking for help debugging this. As shown
> > above this issue is reproducible without KVM, so Andrea's previous KVM
> > specific fix to this code is not applicable. It also still occurs on
> > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > yet. I'm able to reproduce this fairly quickly with the above test,
> > but it's not hard to imagine a test w/o any iommu dependencies which
> > simply does a user directed get_user_pages_fast() on a set of userspace
> > addresses, retains the reference, and at some point later rechecks that
> > a new get_user_pages_fast() results in the same page address. It
> > appears that any sort of device assignment, either vfio or legacy kvm,
> > should be susceptible to this issue and therefore unsafe to use on v4.5+
> > kernels without using explicit hugepages or disabling THP. Thanks,
>
> I'm not able to reproduce it so far. How long does it usually take?
Generally within the first line of output from the test program.
> How much memory your system has? Could you share your kernel config?
24G, dual-socket Ivy Brdige EP.
Config:
https://paste.fedoraproject.org/360803/14618689/
> I've modified your instrumentation slightly to provide more info.
> Could you try this:
Thanks! Results in:
[ 83.429809] page:ffffea0010e57fc0 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.439696] flags: 0x2fffff80000000()
[ 83.443408] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.454001] flags: 0x2fffff80044048(uptodate|active|head|swapbacked)
[ 83.460456] page:ffffea0018a67fc0 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.470298] flags: 0x6fffff80000000()
[ 83.473973] page dumped because: 1
[ 83.477412] page:ffffea0010e57f80 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.487283] flags: 0x2fffff80000000()
[ 83.490969] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.501502] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.508915] page:ffffea0018a67f80 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.518758] flags: 0x6fffff80000000()
[ 83.522443] page dumped because: 1
[ 83.525874] page:ffffea0010e57f40 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.535737] flags: 0x2fffff80000000()
[ 83.539434] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.549979] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.557412] page:ffffea0018a67f40 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.567260] flags: 0x6fffff80000000()
[ 83.570943] page dumped because: 1
[ 83.574366] page:ffffea0010e57f00 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.584211] flags: 0x2fffff80000000()
[ 83.587878] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.598413] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.605862] page:ffffea0018a67f00 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.615722] flags: 0x6fffff80000000()
[ 83.619399] page dumped because: 1
[ 83.622835] page:ffffea0010e57ec0 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.632673] flags: 0x2fffff80000000()
[ 83.636363] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.646893] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.654302] page:ffffea0018a67ec0 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.664150] flags: 0x6fffff80000000()
[ 83.667840] page dumped because: 1
[ 83.671255] page:ffffea0010e57e80 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.681108] flags: 0x2fffff80000000()
[ 83.684783] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.695335] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.702773] page:ffffea0018a67e80 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.712640] flags: 0x6fffff80000000()
[ 83.716335] page dumped because: 1
[ 83.719746] page:ffffea0010e57e40 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.729591] flags: 0x2fffff80000000()
[ 83.733279] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.743843] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.751268] page:ffffea0018a67e40 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.761108] flags: 0x6fffff80000000()
[ 83.764784] page dumped because: 1
[ 83.768206] page:ffffea0010e57e00 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.778076] flags: 0x2fffff80000000()
[ 83.781754] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.792283] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.799712] page:ffffea0018a67e00 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.809559] flags: 0x6fffff80000000()
[ 83.813257] page dumped because: 1
[ 83.816722] page:ffffea0010e57dc0 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.826605] flags: 0x2fffff80000000()
[ 83.830285] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.840877] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.848321] page:ffffea0018a67dc0 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.858214] flags: 0x6fffff80000000()
[ 83.861899] page dumped because: 1
[ 83.865355] page:ffffea0010e57d80 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.875246] flags: 0x2fffff80000000()
[ 83.878930] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.889525] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.896970] page:ffffea0018a67d80 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.906883] flags: 0x6fffff80000000()
[ 83.910563] page dumped because: 1
[ 83.914018] page:ffffea0010e57d40 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.923862] flags: 0x2fffff80000000()
[ 83.927540] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.938079] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.945493] page:ffffea0018a67d40 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 83.955341] flags: 0x6fffff80000000()
[ 83.959022] page dumped because: 1
[ 83.962446] page:ffffea0010e57d00 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[ 83.972296] flags: 0x2fffff80000000()
[ 83.975980] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[ 83.986516] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[ 83.993932] page:ffffea0018a67d00 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[ 84.003778] flags: 0x6fffff80000000()
[ 84.007456] page dumped because: 1
...
As you can see by the kernel timestamp, this happened almost
immediately for me. Thanks for taking a look at this,
Alex