[PATCH 0/16] Basic scheduler support for automatic NUMA balancing V4
From: Mel Gorman
Date: Thu Jul 11 2013 - 05:51:26 EST
This continues to build on the previous feedback and further testing. Peter
posted a patch that avoids overloading a destination node relative to a
source node by postponing the reschedule of tasks on a preferred node. I
took the load calculations but dropped the balancing part as it performed
badly on local tests. It was evident that false sharing within THP pages
is a problem and I think it would alleviate the overloading problem if it
was solved first. Shared accesses are still not properly used for selecting
preferred nodes due to the impact of false sharing within THP pages.
Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected
Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads
Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
preferred node
o Laughably basic accounting of a compute overloaded node when selecting
the preferred node.
o Applied review comments
This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).
This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.
Patch 1 adds sysctl documentation
Patch 2 tracks NUMA hinting faults per-task and per-node
Patch 3 corrects a THP NUMA hint fault accounting bug
Patch 4 avoids trying to migrate the THP zero page
Patches 5-7 selects a preferred node at the end of a PTE scan based on what
node incurrent the highest number of NUMA faults. When the balancer
is comparing two CPU it will prefer to locate tasks on their
preferred node.
Patch 8 reschedules a task when a preferred node is selected if it is not
running on that node already. This avoids waiting for the scheduler
to move the task slowly.
Patch 9 adds infrastructure to allow separate tracking of shared/private
pages but treats all faults as if they are private accesses. Laying
it out this way reduces churn later in the series when private
fault detection is introduced
Patch 10 replaces PTE scanning reset hammer and instread increases the
scanning rate when an otherwise settled task changes its
preferred node.
Patch 11 avoids some unnecessary allocation
Patch 12 sets the scan rate proportional to the size of the task being scanned.
Patch 13-14 kicks away some training wheels and scans shared pages and small VMAs.
Patch 15 introduces private fault detection based on the PID of the faulting
process and accounts for shared/private accesses differently
Patch 16 pick the least loaded CPU based on a preferred node based on a scheduling
domain common to both the source and destination NUMA node.
Testing on this is only partial as full tests take a long time to run. A
full specjbb for both single and multi takes over 4 hours. NPB D class
also takes a few hours. With all the kernels in question, it still takes
a weekend to churn through them all.
Kernel 3.9 is still the testing baseline. The following kernels were tested
o vanilla vanilla kernel with automatic numa balancing enabled
o favorpref-v4 Patches 1-11
o scanshared-v4 Patches 1-14
o splitprivate-v4 Patches 1-15
o accountload-v4 Patches 1-16
This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time.
specjbb
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla favorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
TPut 1 26099.00 ( 0.00%) 24726.00 ( -5.26%) 23924.00 ( -8.33%) 24788.00 ( -5.02%) 23692.00 ( -9.22%)
TPut 7 187276.00 ( 0.00%) 190315.00 ( 1.62%) 189450.00 ( 1.16%) 185294.00 ( -1.06%) 183639.00 ( -1.94%)
TPut 13 318028.00 ( 0.00%) 340088.00 ( 6.94%) 330785.00 ( 4.01%) 334663.00 ( 5.23%) 333818.00 ( 4.96%)
TPut 19 368547.00 ( 0.00%) 422009.00 ( 14.51%) 401622.00 ( 8.97%) 448669.00 ( 21.74%) 447950.00 ( 21.54%)
TPut 25 377522.00 ( 0.00%) 442038.00 ( 17.09%) 413670.00 ( 9.58%) 499595.00 ( 32.34%) 506872.00 ( 34.26%)
TPut 31 347642.00 ( 0.00%) 425809.00 ( 22.48%) 382499.00 ( 10.03%) 487862.00 ( 40.33%) 468347.00 ( 34.72%)
TPut 37 313439.00 ( 0.00%) 402418.00 ( 28.39%) 350941.00 ( 11.96%) 467847.00 ( 49.26%) 437945.00 ( 39.72%)
TPut 43 291958.00 ( 0.00%) 363120.00 ( 24.37%) 313203.00 ( 7.28%) 422984.00 ( 44.88%) 384563.00 ( 31.72%)
First off, note what the shared/private split patch does. Once we start
scanning all pages there is a degradation in performance as the shared page
faults introduce noise to the statistics. All indications are because there
is false sharing within THP pages that needs to be addressed. Splitting
the shared/private faults restores the performance and the key task in
the future is to use this shared/private information for maximum benefit.
specjbb Peaks
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla favorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
Actual Warehouse 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%)
Actual Peak Bops 377522.00 ( 0.00%) 442038.00 ( 17.09%) 413670.00 ( 9.58%) 499595.00 ( 32.34%) 506872.00 ( 34.26%)
Peak performance is improved overall.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
User 5184.53 5177.92 5178.37 5177.24 5181.78
System 59.61 65.77 60.97 67.21 67.43
Elapsed 254.52 254.14 254.06 254.24 254.33
This is an increase in system CPU overhead that needs to be watched.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
THP fault alloc 33297 34710 35229 34480 33510
THP collapse alloc 9 6 14 11 12
THP splits 3 3 3 4 1
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 1773768 1949772 1407218 4253043 4218882
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 1841 2023 1460 4414 4379
NUMA PTE updates 17461135 18458997 14255329 15856615 16071944
NUMA hint faults 85873 172654 80923 91043 90465
NUMA hint local faults 27145 119972 32219 36020 34847
NUMA hint local percent 31 69 39 39 38
NUMA pages migrated 1773768 1949772 1407218 4253043 4218882
AutoNUMA cost 585 1029 531 647 644
It's interesting to note how much scanning shared pages affects the
percentage of local NUMA hinting faults. There is a lot more work to do
there. There are fewer PTE scan updates but there are a much larger number
of pages being migrated that will need examination. Due to the overall
performance the focus will still be on false THP sharing.
Next is the autonuma benchmark results. These were only run once so I have no
idea what the variance is. Obviously they could be run multiple times but with
this number of kernels we would die of old age waiting on the results.
autonumabench
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla favorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
User NUMA01 52623.86 ( 0.00%) 49514.41 ( 5.91%) 53783.60 ( -2.20%) 51205.78 ( 2.69%) 53501.03 ( -1.67%)
User NUMA01_THEADLOCAL 17595.48 ( 0.00%) 17620.51 ( -0.14%) 19734.74 (-12.16%) 16966.63 ( 3.57%) 17113.31 ( 2.74%)
User NUMA02 2043.84 ( 0.00%) 1993.04 ( 2.49%) 2051.29 ( -0.36%) 1901.96 ( 6.94%) 2035.80 ( 0.39%)
User NUMA02_SMT 1057.11 ( 0.00%) 1005.61 ( 4.87%) 980.19 ( 7.28%) 977.65 ( 7.52%) 972.60 ( 7.99%)
System NUMA01 414.17 ( 0.00%) 222.86 ( 46.19%) 145.79 ( 64.80%) 321.93 ( 22.27%) 344.93 ( 16.72%)
System NUMA01_THEADLOCAL 105.17 ( 0.00%) 102.35 ( 2.68%) 117.22 (-11.46%) 105.35 ( -0.17%) 102.54 ( 2.50%)
System NUMA02 9.36 ( 0.00%) 9.96 ( -6.41%) 13.02 (-39.10%) 9.53 ( -1.82%) 6.73 ( 28.10%)
System NUMA02_SMT 3.54 ( 0.00%) 3.53 ( 0.28%) 3.46 ( 2.26%) 5.85 (-65.25%) 4.49 (-26.84%)
Elapsed NUMA01 1201.52 ( 0.00%) 1143.59 ( 4.82%) 1244.61 ( -3.59%) 1182.92 ( 1.55%) 1208.74 ( -0.60%)
Elapsed NUMA01_THEADLOCAL 393.91 ( 0.00%) 392.49 ( 0.36%) 442.04 (-12.22%) 385.61 ( 2.11%) 386.43 ( 1.90%)
Elapsed NUMA02 50.30 ( 0.00%) 50.36 ( -0.12%) 49.53 ( 1.53%) 48.91 ( 2.76%) 49.23 ( 2.13%)
Elapsed NUMA02_SMT 58.48 ( 0.00%) 47.79 ( 18.28%) 51.56 ( 11.83%) 55.98 ( 4.27%) 56.34 ( 3.66%)
CPU NUMA01 4414.00 ( 0.00%) 4349.00 ( 1.47%) 4333.00 ( 1.84%) 4355.00 ( 1.34%) 4454.00 ( -0.91%)
CPU NUMA01_THEADLOCAL 4493.00 ( 0.00%) 4515.00 ( -0.49%) 4490.00 ( 0.07%) 4427.00 ( 1.47%) 4455.00 ( 0.85%)
CPU NUMA02 4081.00 ( 0.00%) 3977.00 ( 2.55%) 4167.00 ( -2.11%) 3908.00 ( 4.24%) 4148.00 ( -1.64%)
CPU NUMA02_SMT 1813.00 ( 0.00%) 2111.00 (-16.44%) 1907.00 ( -5.18%) 1756.00 ( 3.14%) 1734.00 ( 4.36%)
numa01 saw no major performance benefit with a mix of gains and losses
throughout the series for its system CPU usage. It is an adverse workload
for this machine so right now I'm not overly concerned with improving its
performance.
numa01_threadlocal saw a very small performance gain overall although
it is interesting to note that scanning shared pages hurt it badly. Again
I predict that better shared page detection will help here.
numa02 showed a small improvement but it should also be already running
close to as quickly as possible.
numa02_smt also shows a small improvement although again scanning shared
pages hurt and would benefit from improved handling there.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
THP fault alloc 14325 11724 14906 13553 14403
THP collapse alloc 6 3 7 13 10
THP splits 4 1 4 2 2
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 9020528 9708110 6677767 6773951 6170746
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 9363 10077 6931 7031 6405
NUMA PTE updates 119292401 114641446 85954812 74337906 75911999
NUMA hint faults 755901 499186 287825 237095 232126
NUMA hint local faults 595478 333483 152899 122210 128762
NUMA hint local percent 78 66 53 51 55
NUMA pages migrated 9020528 9708110 6677767 6773951 6170746
AutoNUMA cost 4785 3482 2167 1834 1809
As all the tests are mashed together it is possible to make specific
conclusions on each testcase. However, in general the series is doing a lot
less work with PTE updates, faults and so on. THe percentage of local faults
suffers but a large part of this seems to be around where shared pages are
getting scanned.
I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system.
specjbb
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla favorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
Mean 1 30640.75 ( 0.00%) 31222.25 ( 1.90%) 31275.50 ( 2.07%) 30554.00 ( -0.28%) 30348.75 ( -0.95%)
Mean 10 136983.25 ( 0.00%) 133072.00 ( -2.86%) 140022.00 ( 2.22%) 119168.25 (-13.01%) 140998.00 ( 2.93%)
Mean 19 124005.25 ( 0.00%) 121016.25 ( -2.41%) 122189.00 ( -1.46%) 111813.75 ( -9.83%) 129100.75 ( 4.11%)
Mean 28 114672.00 ( 0.00%) 111643.00 ( -2.64%) 109175.75 ( -4.79%) 101199.50 (-11.75%) 116026.50 ( 1.18%)
Mean 37 110916.50 ( 0.00%) 105791.75 ( -4.62%) 103103.75 ( -7.04%) 100187.00 ( -9.67%) 108801.00 ( -1.91%)
Mean 46 110139.25 ( 0.00%) 105383.25 ( -4.32%) 99454.75 ( -9.70%) 99762.00 ( -9.42%) 104239.25 ( -5.36%)
Stddev 1 1002.06 ( 0.00%) 1125.30 (-12.30%) 959.60 ( 4.24%) 960.28 ( 4.17%) 1014.89 ( -1.28%)
Stddev 10 4656.47 ( 0.00%) 6679.25 (-43.44%) 5946.78 (-27.71%) 10427.37(-123.93%) 4039.93 ( 13.24%)
Stddev 19 2578.12 ( 0.00%) 5261.94 (-104.10%) 3414.66 (-32.45%) 5070.00 (-96.65%) 1849.10 ( 28.28%)
Stddev 28 4123.69 ( 0.00%) 4156.17 ( -0.79%) 6666.32 (-61.66%) 3899.89 ( 5.43%) 3081.40 ( 25.28%)
Stddev 37 2301.94 ( 0.00%) 5225.48 (-127.00%) 5444.18(-136.50%) 3490.87 (-51.65%) 1795.72 ( 21.99%)
Stddev 46 8317.91 ( 0.00%) 6759.04 ( 18.74%) 6587.32 ( 20.81%) 4458.49 ( 46.40%) 7387.32 ( 11.19%)
TPut 1 122563.00 ( 0.00%) 124889.00 ( 1.90%) 125102.00 ( 2.07%) 122216.00 ( -0.28%) 121395.00 ( -0.95%)
TPut 10 547933.00 ( 0.00%) 532288.00 ( -2.86%) 560088.00 ( 2.22%) 476673.00 (-13.01%) 563992.00 ( 2.93%)
TPut 19 496021.00 ( 0.00%) 484065.00 ( -2.41%) 488756.00 ( -1.46%) 447255.00 ( -9.83%) 516403.00 ( 4.11%)
TPut 28 458688.00 ( 0.00%) 446572.00 ( -2.64%) 436703.00 ( -4.79%) 404798.00 (-11.75%) 464106.00 ( 1.18%)
TPut 37 443666.00 ( 0.00%) 423167.00 ( -4.62%) 412415.00 ( -7.04%) 400748.00 ( -9.67%) 435204.00 ( -1.91%)
TPut 46 440557.00 ( 0.00%) 421533.00 ( -4.32%) 397819.00 ( -9.70%) 399048.00 ( -9.42%) 416957.00 ( -5.36%)
Performance here is more or less flat although it's interesting to
note how much scanning share pages affects the differences between JVM
performance. Overall the series performance is more or less unchanged with
some improvements in varaiability. This should also benefit from false
sharing detection but it would also benefit if there was proper detection
of related tasks that share pages.
specjbb Peaks
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla favorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
Actual Warehouse 11.00 ( 0.00%) 11.00 ( 0.00%) 11.00 ( 0.00%) 11.00 ( 0.00%) 11.00 ( 0.00%)
Actual Peak Bops 547933.00 ( 0.00%) 532288.00 ( -2.86%)560088.00 ( 2.22%) 476673.00 (-13.01%) 563992.00 ( 2.93%)
Accounting for load recovers the loss from splitting private/shared. Again,
proper false shared detection is required.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
User 52899.04 53106.74 53245.67 52828.25 53162.02
System 250.42 254.20 203.97 222.28 230.85
Elapsed 1199.72 1208.35 1206.14 1197.28 1207.10
Small reduction in system CPU overhead.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v4 scanshared-v4 splitprivate-v4 accountload-v4
THP fault alloc 65188 66217 68158 63283 65531
THP collapse alloc 97 172 91 108 135
THP splits 38 37 36 34 41
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 14583860 14559261 7770770 10131560 10932731
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 15138 15112 8066 10516 11348
NUMA PTE updates 128327468 129131539 74033679 72954561 72832728
NUMA hint faults 2103190 1712971 1488709 1362365 1292772
NUMA hint local faults 734136 640363 405816 471928 403028
NUMA hint local percent 34 37 27 34 31
NUMA pages migrated 14583860 14559261 7770770 10131560 10932731
AutoNUMA cost 11691 9745 8109 7515 7181
Fewer PTE updates but the percentage of local hinting faults clearly
needs improvement.
Overall the series performs well even though the gaps are still evident.
This is likely to be my last update to this series for a while but I'd
like to see this treated as a standalone with a separate series focusing on
false sharing detection and reduction, shared accesses used for selecting
preferred nodes, shared accesses used for load balancing and reintroducing
Peter's patch that balances compute nodes relative to each other. This is
to keep each series a manageable size for review even if it's obvious that
more work is required.
Documentation/sysctl/kernel.txt | 68 ++++++++
include/linux/migrate.h | 7 +-
include/linux/mm.h | 69 +++++---
include/linux/mm_types.h | 7 +-
include/linux/page-flags-layout.h | 28 ++--
include/linux/sched.h | 23 ++-
include/linux/sched/sysctl.h | 1 -
kernel/sched/core.c | 26 ++-
kernel/sched/fair.c | 321 +++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 12 ++
kernel/sysctl.c | 14 +-
mm/huge_memory.c | 26 ++-
mm/memory.c | 27 ++--
mm/mempolicy.c | 8 +-
mm/migrate.c | 21 +--
mm/mm_init.c | 18 +--
mm/mmzone.c | 12 +-
mm/mprotect.c | 28 ++--
mm/page_alloc.c | 4 +-
19 files changed, 568 insertions(+), 152 deletions(-)
--
1.8.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/