Re: Handling NUMA page migration

From: Frank Mehnert
Date: Thu Jun 06 2013 - 06:09:41 EST


On Wednesday 05 June 2013 14:34:00 Mel Gorman wrote:
> On Wed, Jun 05, 2013 at 12:35:35PM +0200, Frank Mehnert wrote:
> > On Wednesday 05 June 2013 12:10:19 Mel Gorman wrote:
> > > On Tue, Jun 04, 2013 at 06:58:07AM -0500, Robin Holt wrote:
> > > > > B) 1. allocate memory with alloc_pages()
> > > > >
> > > > > 2. SetPageReserved()
> > > > > 3. vm_mmap() to allocate a userspace mapping
> > > > > 4. vm_insert_page()
> > > > > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > > > >
> > > > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP |
> > > > > VM_DONTEXPAND | 0xff)
> > > > >
> > > > > At least the memory allocated like B) is affected by automatic NUMA
> > > > > page migration. I'm not sure about A).
> > > > >
> > > > > 1. How can I prevent automatic NUMA page migration on this memory?
> > > > > 2. Can NUMA page migration also be handled on such kind of memory
> > > > > without
> > > > >
> > > > > preventing migration?
> > >
> > > Page migration does not expect a PageReserved && PageLRU page. The only
> > > reserved check that is made by migration is for the zero page and that
> > > happens in the syscall path for move_pages() which is not used by
> > > either compaction or automatic balancing.
> > >
> > > At some point you must have a driver that is setting PageReserved on
> > > anonymous pages that is later encountered by automatic numa balancing
> > > during a NUMA hinting fault. I expect this is an out-of-tree driver or
> > > a custom kernel of some sort. Memory should be pinned by elevating the
> > > reference count of the page, not setting PageReserved.
> >
> > Yes, this is ring 0 code from VirtualBox. The VBox ring 0 driver does the
> > steps which are shown above. Setting PageReserved is not only for pinning
> > but also for fork() protection.
>
> Offhand I don't see what setting PageReserved on an LRU page has to do
> with fork() protection. If the VMA should not be copied by fork then use
> MADV_DONTFORK.

I'm not sure either. That code has grown over years and was even working
on Linux 2.4.

> > I've tried to do get_page() as well but
> > it did not help preventing the migration during NUMA balancing.
>
> I think you mean elevating the page count did not prevent the unmapping.
> The elevated count should have prevented the actual migration but would
> not prevent the unmapping.

Right, that's what I meant and your explanations make sense to me.

> > As I wrote, the code for allocating + mapping the memory assumes that
> > the memory is finally pinned and will be never unmapped. That assumption
> > might be wrong or wrong under certain/rare conditions. I would like to
> > know these conditions and how we can prevent them from happening or how
> > we can handle them correctly.
>
> Memory compaction for THP allocations will break that assumption as
> compaction ignores VM_LOCKED. I strongly suspect that if you did something
> like move a process into a cpuset bound to another node that it would
> also break. If a process like numad is running then it would probably
> break virtualbox as well as it triggers migration from userspace. It is
> a fragile assumption to make.
>
> > > It's not particularly clear how you avoid hitting the same bug due to
> > > THP and memory compaction to be honest but maybe your setup hits a
> > > steady state that simply never hit the problem or it happens rarely
> > > and it was not identified.
> >
> > I'm currently using the stock Ubuntu 13.04 generic kernel (3.8.0-23),
>
> and an out-of-tree driver which is what is hitting the problem.

Right.

> A few of your options in order of estimated time to completion are;
>
> 1. Disable numa balancing within your driver or fail to start if it's
> running
> 2. Create a patch that adds a new NUMA_PTE_SCAN_IGNORE value for
> mm->first_nid (see includ/linux.mm_types.h). In sched/core/fair.c,
> add a check that first_nid == NUMA_PTE_SCAN_IGNORE should be ignored.
> Document that only virtualbox needs this and set it within your
> driver. This will not fix the compaction cases or numad using cpusets
> to migrate your processes though
> 3. When the driver affects a region, set mm->numa_next_reset and
> mm->numa_next_scan to large values to prevent the pages being unmapped.
> This would be very fragile, could break again in the future and is ugly
> 4. Add a check in change_pte_range() for the !prot_numa case to check
> PageReserved. This will prevent automatic numa balancing unmapping the
> page. Document that only virtualbox requires this.
> 5. Add a check in change_pte_range() for an elevated page count.
> Document that there is no point unmapping a page for a NUMA hinting
> fault that will only fail migration later anyway which is true albeit of
> marginal benefit. Then, in the vbox driver, elevate the page count, do
> away with the PageReserved trick, use MADV_DONTFORK to prevent copying
> at fork time.

Thank you for these suggestions! For now I tried your suggestion 4) although
I think you meant the prot_numa case, not the !prot_numa case, correct?

It also turned out that we even must not do ptep_modify_prot_start() for such
ranges, therefore I added the PageReserved() check like this:

--- mm/mprotect.c 2013-06-05 18:24:41.564777871 +0200
+++ mm/mprotect.c 2013-06-05 17:16:47.689923398 +0200
@@ -54,14 +54,22 @@
pte_t ptent;
bool updated = false;

+ struct page *page;
+
+ page = vm_normal_page(vma, addr, oldpte);
+ if (page && PageReserved(page))
+ continue;
+
ptent = ptep_modify_prot_start(mm, addr, pte);
if (!prot_numa) {
ptent = pte_modify(ptent, newprot);
updated = true;
} else {
+#if 0
struct page *page;
page = vm_normal_page(vma, addr, oldpte);
+#endif
if (page) {
int this_nid = page_to_nid(page);
if (last_nid == -1)

With this change I cannot reproduce any problems anymore.

Adding such a change to the kernel would help us a lot. OTOH I wonder why it
is not possible to prevent these unmaps with other means, for instance for
VM arease with VM_IO set. Wouldn't that make sense?

What I didn't mention explicitely in my previous postings: I assume that all
these problems come also from using R3 addresses from R0 code. That might be
evil but VirtualBox does currently map the complete guest address space into
the address space of the corresponding host process for simplicity reasons.
Mapping into R0 isn't possible, at least not on 32-bit hosts. But I would
like to know if R0 mappings (vmap()) would be affected by any kind of page
migration.

Thanks,

Frank
--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603
Geschäftsführer: Jürgen Kunz

Komplementärin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Alexander van der Ven, Astrid Kepper, Val Maher

Attachment: signature.asc
Description: This is a digitally signed message part.