[PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work

From: Rik van Riel
Date: Fri Sep 11 2015 - 09:01:06 EST


Currently task_numa_work scans up to numa_balancing_scan_size_mb worth
of memory per invocation, but only counts memory areas that have at
least one PTE that is still present and not marked for numa hint faulting.

It will skip over arbitarily large amounts of memory that are either
unused, full of swap ptes, or full of PTEs that were already marked
for NUMA hint faults but have not been faulted on yet.

This can cause excessive amounts of CPU use, due to there being
essentially no upper limit on the scan rate of very large processes
that are not yet in a phase where they are actively accessing old
memory pages (eg. they are still initializing their data).

Avoid that problem by placing an upper limit on the amount of virtual
memory that task_numa_work scans in each invocation. This can be a
higher limit than "pages", to ensure the task still skips over unused
areas fairly quickly.

While we are here, also fix the "nr_pte_updates" logic, so it only
counts page ranges with ptes in them.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>
Reported-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Reported-by: Jan Stancek <jstancek@xxxxxxxxxx>
---
kernel/sched/fair.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e2e3483b1ec..ff51b559ccaf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2157,7 +2157,7 @@ void task_numa_work(struct callback_head *work)
struct vm_area_struct *vma;
unsigned long start, end;
unsigned long nr_pte_updates = 0;
- long pages;
+ long pages, virtpages;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -2203,9 +2203,11 @@ void task_numa_work(struct callback_head *work)
start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+ virtpages = pages * 8; /* Scan up to this much virtual space */
if (!pages)
return;

+
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
if (!vma) {
@@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
end = min(end, vma->vm_end);
- nr_pte_updates += change_prot_numa(vma, start, end);
+ nr_pte_updates = change_prot_numa(vma, start, end);

/*
- * Scan sysctl_numa_balancing_scan_size but ensure that
- * at least one PTE is updated so that unused virtual
- * address space is quickly skipped.
+ * Try to scan sysctl_numa_balancing_size worth of
+ * hpages that have at least one present PTE that
+ * is not already pte-numa. If the VMA contains
+ * areas that are unused or already full of prot_numa
+ * PTEs, scan up to virtpages, to skip through those
+ * areas faster.
*/
if (nr_pte_updates)
pages -= (end - start) >> PAGE_SHIFT;
+ virtpages -= (end - start) >> PAGE_SHIFT;

start = end;
- if (pages <= 0)
+ if (pages <= 0 || virtpages <= 0)
goto out;

cond_resched();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/