Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues

From: Chen Yu
Date: Wed Aug 30 2023 - 02:18:31 EST


Hi David,

On 2023-08-09 at 17:12:18 -0500, David Vernet wrote:
> The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
> tasks are put into on enqueue, and pulled from when a core in that LLC
> would otherwise go idle. For CPUs with large LLCs, this can sometimes
> cause significant contention, as illustrated in [0].
>
> [0]: https://lore.kernel.org/all/c8419d9b-2b31-2190-3058-3625bdbcb13d@xxxxxxxx/
>
> So as to try and mitigate this contention, we can instead shard the
> per-LLC runqueue into multiple per-LLC shards.
>
> While this doesn't outright prevent all contention, it does somewhat mitigate it.
>

Thanks for this proposal to make idle load balance more efficient. As we
dicussed previously, I launched hackbench on Intel Sapphire Rapids
and I have some findings.

This platform has 2 sockets, each socket has 56C/112T. To avoid the
run-run variance, only 1 socket is online, the cpufreq govenor is set to
performance, the turbo is disabled, and C-states deeper than C1 are disabled.

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 1.09) +0.55 ( 0.20)
process-pipe 2-groups 1.00 ( 0.60) +3.57 ( 0.28)
process-pipe 4-groups 1.00 ( 0.30) +5.22 ( 0.26)
process-pipe 8-groups 1.00 ( 0.10) +43.96 ( 0.26)
process-sockets 1-groups 1.00 ( 0.18) -1.56 ( 0.34)
process-sockets 2-groups 1.00 ( 1.06) -12.37 ( 0.11)
process-sockets 4-groups 1.00 ( 0.29) +0.21 ( 0.19)
process-sockets 8-groups 1.00 ( 0.06) +3.59 ( 0.39)

The 8 groups pipe mode has an impressive improvement, while the 2 groups sockets
mode did see some regressions.

The possible reason to cause the regression is at the end of this reply, in
case you want to see the conclusion directly : )

To investigate the regression, I did slight hack on the hackbench, by renaming
the workload to sender and receiver.

When it is in 2 groups mode, there would be 2 groups of senders and receivers.
Each group has 14 senders and 14 receivers. So there are totally 56 tasks running
on 112 CPUs. In each group, sender_i sends package to receiver_j ( i, j belong to [1,14] )


1. Firstly use 'top' to monitor the CPU utilization:

When shared_runqueue is disabled, many CPUs are 100%, while the other
CPUs remain 0%.
When shared_runqueue is enabled, most CPUs are busy and the utilization is
in 40%~60%.

This means that shared_runqueue works as expected.

2. Then the bpf wakeup latency is monitored:

tracepoint:sched:sched_wakeup,
tracepoint:sched:sched_wakeup_new
{
if (args->comm == "sender") {
@qstime[args->pid] = nsecs;
}
if (args->comm == "receiver") {
@qrtime[args->pid] = nsecs;
}
}

tracepoint:sched:sched_switch
{
if (args->next_comm == "sender") {
$ns = @qstime[args->next_pid];
if ($ns) {
@sender_wakeup_lat = hist((nsecs - $ns) / 1000);
delete(@qstime[args->next_pid]);
}
}
if (args->next_comm == "receiver") {
$ns = @qrtime[args->next_pid];
if ($ns) {
@receiver_wakeup_lat = hist((nsecs - $ns) / 1000);
delete(@qstime[args->next_pid]);
}
}
}


It shows that, the wakeup latency of the receiver has been increased a little
bit. But consider that this symptom is the same when the hackbench is in pipe mode,
and there is no regression in pipe mode, the wakeup latency overhead might not be
the cause of the regression.

3. Then FlameGraph is used to compare the bottleneck.
There is still no obvious difference noticed. One obvious bottleneck is the atomic
write to a memory cgroup page count(and runqueue lock contention is not observed).
The backtrace:

obj_cgroup_charge_pages;obj_cgroup_charge;__kmem_cache_alloc_node;
__kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;
alloc_skb_with_frags;sock_alloc_send_pskb;unix_stream_sendmsg

However there is no obvious ratio difference between with/without shared runqueue
enabled. So this one might not be the cause.

4. Check the wakeup task migration count

Borrow the script from Aaron:
kretfunc:select_task_rq_fair
{
$p = (struct task_struct *)args->p;
if ($p->comm == "sender") {
if ($p->thread_info.cpu != retval) {
@wakeup_migrate_sender = count();
} else {
@wakeup_prev_sender = count();
}
}
if ($p->comm == "receiver") {
if ($p->thread_info.cpu != retval) {
@wakeup_migrate_receiver = count();
} else {
@wakeup_prev_receiver = count();
}
}
}

Without shared_runqueue enabled, the wakee task are mostly woken up on it
previous running CPUs.
With shared_runqueue disabled, the wakee task are mostly woken up on a
completely different idle CPUs.

This reminds me that, is it possible the regression was caused by the broken
cache locallity?


5. Check the L2 cache miss rate.
perf stat -e l2_rqsts.references,l2_request.miss sleep 10
The results show that the L2 cache miss rate is nearly the same with/without
shared_runqueue enabled.

I did not check the L3 miss rate, because:
3.1 there is only 1 socket of CPUs online
3.2 the working set the hackbench is 56 * 100 * 300000, which is nearly
the same as LLC cache size.

6. As mentioned in step 3, the bottleneck is a atomic write to a global
variable. Then use perf c2c to check if there is any false/true sharing.

According to the result, the total number and average cycles of local HITM
is low.So this might indicate that this is not a false sharing or true
sharing issue.


7. Then use perf topdown to dig into the detail. The methodology is at
https://perf.wiki.kernel.org/index.php/Top-Down_Analysis


When shared runqueue is disabled:

# 65.2 % tma_backend_bound
# 2.9 % tma_bad_speculation
# 13.1 % tma_frontend_bound
# 18.7 % tma_retiring



When shared runqueue is enabled:

# 52.4 % tma_backend_bound
# 3.3 % tma_bad_speculation
# 20.5 % tma_frontend_bound
# 23.8 % tma_retiring


We can see that, the ratio of frontend_bound has increased from 13.1% to
20.5%. As a comparison, this ratio does not increase when the hackbench
is in pipe mode.

Then further dig into the deeper level of frontend_bound:

When shared runqueue is disabled:
# 6.9 % tma_fetch_latency ----> # 7.3 % tma_ms_switches
|
----> # 7.1 % tma_dsb_switches


When shared runqueue is enabled:
# 11.6 % tma_fetch_latency ----> # 6.7 % tma_ms_switches
|
----> # 7.8 % tma_dsb_switches


1. The DSB(Decode Stream Buffer) switches count increases
from 13.1% * 6.9% * 7.1% to 20.5% * 11.6% * 7.8%

2. The MS(Microcode Sequencer) switches count increases
from 13.1% * 6.9% * 7.3% to 20.5% * 11.6% * 6.7%

DSB is the cached decoded uops, which is similar to L1 icache,
except that icache has the original instructions, while DSB has the
decoded one. DSB reflects the instruction footprint. The increase
of DSB switches mean that, the cached buffer has been thrashed a lot.

MS is to decode the complex instructions, the increase of MS switch counter
usually means that the workload is running some complex instruction.
that the workload is running complex instructions.

In summary:

So the scenario to cause this issue I'm thinking of is:
Task migration increases the DSB switches count. With shared_runqueue enabled,
the task could be migrated to different CPUs more offen. And it has to fill its
new uops into the DSB, but that DSB has already been filled by the old task's
uops. So DSB switches is triggered to decode the new macro ops. This is usually
not a problem if the workload runs some simple instructions. However if
this workload's instruction footprint increases, task migration might break
the DSB uops locality, which is similar to L1/L2 cache locality.

I wonder, if SHARED_RUNQ can consider that, if a task is a long duration one,
say, p->avg_runtime >= sysctl_migration_cost, maybe we should not put such task
on the per-LLC shared runqueue? In this way it will not be migrated too offen
so as to keep its locality(both in terms of L1/L2 cache and DSB).

thanks,
Chenyu