Re: SD_LOAD_BALANCE
From: Julia Lawall
Date: Sat Oct 10 2020 - 19:15:31 EST
Hello,
Previously, I was wondering about why starting in Linux v5.8 my unblocking
threads were moving to different sockets more frequently than in previous
releases. Now, I think that I have found the reason.
The first issue is the change from runnable load average to load average
in computing wake_affine_weight:
---
commit 11f10e5420f6cecac7d4823638bff040c257aba9
Author: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Date: Fri Oct 18 15:26:36 2019 +0200
sched/fair: Use load instead of runnable load in wakeup path
Runnable load was originally introduced to take into account the case where
blocked load biases the wake up path which may end to select an overloaded
CPU with a large number of runnable tasks instead of an underutilized
CPU with a huge blocked load.
Tha wake up path now starts looking for idle CPUs before comparing
runnable load and it's worth aligning the wake up path with the
load_balance() logic.
---
The unfortunate case is illustrated by the following trace (*** on the
important lines):
trace-cmd-8006 [118] 451.444751: sched_migrate_task: comm=containerd pid=2481 prio=120 orig_cpu=114 dest_cpu=118
ua.B.x-8007 [105] 451.444752: sched_switch: ua.B.x:8007 [120] S ==> swapper/105:0 [120]
trace-cmd-8006 [118] 451.444769: sched_switch: ua.B.x:8006 [120] S ==> containerd:2481 [120] ***
containerd-2481 [118] 451.444859: sched_switch: containerd:2481 [120] S ==> swapper/118:0 [120] ***
ua.B.x-8148 [016] 451.444910: sched_switch: ua.B.x:8148 [120] S ==> swapper/16:0 [120]
ua.B.x-8154 [127] 451.445186: sched_switch: ua.B.x:8154 [120] S ==> swapper/127:0 [120]
ua.B.x-8145 [047] 451.445199: sched_switch: ua.B.x:8145 [120] S ==> swapper/47:0 [120]
ua.B.x-8138 [147] 451.445200: sched_switch: ua.B.x:8138 [120] S ==> swapper/147:0 [120]
ua.B.x-8152 [032] 451.445210: sched_switch: ua.B.x:8152 [120] S ==> swapper/32:0 [120]
ua.B.x-8144 [067] 451.445215: sched_switch: ua.B.x:8144 [120] S ==> swapper/67:0 [120]
ua.B.x-8137 [000] 451.445219: sched_switch: ua.B.x:8137 [120] S ==> swapper/0:0 [120]
ua.B.x-8140 [075] 451.445225: sched_switch: ua.B.x:8140 [120] S ==> swapper/75:0 [120]
ua.B.x-8155 [084] 451.445229: sched_switch: ua.B.x:8155 [120] S ==> swapper/84:0 [120]
ua.B.x-8161 [155] 451.445232: sched_switch: ua.B.x:8161 [120] S ==> swapper/155:0 [120]
ua.B.x-8156 [095] 451.445261: sched_switch: ua.B.x:8156 [120] S ==> swapper/95:0 [120]
ua.B.x-8153 [134] 451.445268: sched_switch: ua.B.x:8153 [120] S ==> swapper/134:0 [120]
ua.B.x-8151 [154] 451.445268: sched_switch: ua.B.x:8151 [120] S ==> swapper/154:0 [120]
ua.B.x-8141 [107] 451.445273: sched_switch: ua.B.x:8141 [120] S ==> swapper/107:0 [120]
ua.B.x-8146 [131] 451.445275: sched_switch: ua.B.x:8146 [120] S ==> swapper/131:0 [120]
ua.B.x-8160 [035] 451.445286: sched_switch: ua.B.x:8160 [120] S ==> swapper/35:0 [120]
ua.B.x-8136 [088] 451.445286: sched_switch: ua.B.x:8136 [120] S ==> swapper/88:0 [120]
ua.B.x-8159 [056] 451.445290: sched_switch: ua.B.x:8159 [120] S ==> swapper/56:0 [120]
ua.B.x-8147 [036] 451.445294: sched_switch: ua.B.x:8147 [120] S ==> swapper/36:0 [120]
ua.B.x-8150 [150] 451.445298: sched_switch: ua.B.x:8150 [120] S ==> swapper/150:0 [120]
ua.B.x-8142 [159] 451.445300: sched_switch: ua.B.x:8142 [120] S ==> swapper/159:0 [120]
ua.B.x-8157 [058] 451.445309: sched_switch: ua.B.x:8157 [120] S ==> swapper/58:0 [120]
ua.B.x-8149 [123] 451.445311: sched_switch: ua.B.x:8149 [120] S ==> swapper/123:0 [120]
ua.B.x-8162 [156] 451.445313: sched_switch: ua.B.x:8162 [120] S ==> swapper/156:0 [120]
ua.B.x-8164 [019] 451.445317: sched_switch: ua.B.x:8164 [120] S ==> swapper/19:0 [120]
ua.B.x-8139 [068] 451.445319: sched_switch: ua.B.x:8139 [120] S ==> swapper/68:0 [120]
ua.B.x-8143 [126] 451.445335: sched_switch: ua.B.x:8143 [120] S ==> swapper/126:0 [120]
ua.B.x-8163 [062] 451.445361: sched_switch: ua.B.x:8163 [120] S ==> swapper/62:0 [120]
ua.B.x-8158 [129] 451.445371: sched_switch: ua.B.x:8158 [120] S ==> swapper/129:0 [120]
ua.B.x-8040 [043] 451.445413: sched_wake_idle_without_ipi: cpu=0
ua.B.x-8165 [098] 451.445451: sched_switch: ua.B.x:8165 [120] S ==> swapper/98:0 [120]
ua.B.x-8069 [009] 451.445622: sched_waking: comm=ua.B.x pid=8007 prio=120 target_cpu=105
ua.B.x-8069 [009] 451.445635: sched_wake_idle_without_ipi: cpu=105
ua.B.x-8069 [009] 451.445638: sched_wakeup: ua.B.x:8007 [120] success=1 CPU:105
ua.B.x-8069 [009] 451.445639: sched_waking: comm=ua.B.x pid=8006 prio=120 target_cpu=118
<idle>-0 [105] 451.445641: sched_switch: swapper/105:0 [120] R ==> ua.B.x:8007 [120]
ua.B.x-8069 [009] 451.445645: bprint: select_task_rq_fair: wake_affine_weight2 returning this_cpu: 614400 < 2981888
ua.B.x-8069 [009] 451.445650: sched_migrate_task: comm=ua.B.x pid=8006 prio=120 orig_cpu=118 dest_cpu=129 ***
Basically, core 118 has run both a thread of the NAS UA benchmark and a
containerd, and so it seems to have a higher load average than this_cpu, ie
9, when it wakes up. The commit says "The wake up path now starts looking
for idle CPUs", but this is not always the case. Here the target and prev
are not on the same socket, and in that case cpus_share_cache(prev, target)
fails and there is no check as to whether prev is idle. The result is that
an idle core is left idle and the thread is migrated to another socket,
perhaps impacting locality.
Prior to v5.8 on my machine this was a rare event, because there were not
many of these background processes. But in v5.8, the default governor for
Intel machines without the HWP feature was changed from intel_pstate to
intel_cpufreq. The use of intel_cpufreq triggers very frequent kworkers on
all cores, which makes it much more likely that cores that are currently
idle, and are overall not at all overloaded, will have a higher load
average even with the waking thread deducted, than the core managing the
wakeup of the threads.
Would it be useful to always check whether prev is idle, perhaps in
wake_affine_idle or perhaps in select_idle_sibling?
Traces from various versions are available at
https://pages.lip6.fr/Julia.Lawall/uas.pdf. 5.8 and 5.9-rc7 are toward the
end of the file. In these versions, all the threads systematically move
around at synchronization points in the program.
thanks,
julia