[PATCH 0/2] sched,numa: cap pte scanning overhead to 3% of run time

From: riel
Date: Thu Nov 05 2015 - 15:56:36 EST

Jan Stancek identified an LTP stress test causing trouble with the
NUMA balancing code. The test forks off enough 3GB sized tasks to
fill up 80% of system memory on a system with 12TB RAM. That results
in over 2000 tasks allocating and touching memory simultaneously.

The NUMA balancing code causes each task to scan a certain number of PTEs
every 10ms. Due to the large number of tasks on the system, and the large
amount of memory in each process, it may take 10ms for each task to finish
its PTE scan.

Meanwhile, the NUMA code only tries to ensure each task has used a few (2-3)
ms of CPU time in-between invocations of task_numa_work.

On a system that overloaded, we end up spending essentially all of our
CPU time in task_numa_work, and the tasks make very little progress.

Allocating all the memory can take several hours.

With these patches, the CPU time spent in task_numa_work is limited to
around 3% of run time, and the test case completes in minutes.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/