Re: [PATCH v8 00/25] timer: Move from a push remote at enqueue to a pull at expiry model

From: Lukasz Luba
Date: Fri Oct 13 2023 - 07:35:03 EST


Hi Anna-Maria

On 10/4/23 13:34, Anna-Maria Behnsen wrote:
Hi,


[snip]



Testing
~~~~~~~

Enqueue
^^^^^^^

The impact of wasting cycles during enqueue by using the heuristic in
contrast to always queuing the timer on the local CPU was measured with a
micro benchmark. Therefore a timer is enqueued and dequeued in a loop with
1000 repetitions on a isolated CPU. The time the loop takes is measured. A
quarter of the remaining CPUs was kept busy. This measurement was repeated
several times. With the patch queue the average duration was reduced by
approximately 25%.

145ns plain v6
109ns v6 with patch queue


Furthermore the impact of residence in deep idle states of an idle system
was investigated. The patch queue doesn't downgrade this behavior.

dbench test
^^^^^^^^^^^

A dbench test starting X pairs of client servers are used to create load on
the system. The measurable value is the throughput. The tests were executed
on a zen3 machine. The base is the tip tree branch timers/core which is
based on a v6.6-rc1.

governor menu

X pairs timers/core pull-model impact
----------------------------------------------
1 353.19 (0.19) 353.45 (0.30) 0.07%
2 700.10 (0.96) 687.00 (0.20) -1.87%
4 1329.37 (0.63) 1282.91 (0.64) -3.49%
8 2561.16 (1.28) 2493.56 (1.76) -2.64%
16 4959.96 (0.80) 4914.59 (0.64) -0.91%
32 9741.92 (3.44) 8979.83 (1.13) -7.82%
64 16535.40 (2.84) 16388.47 (4.02) -0.89%
128 22136.83 (2.42) 23174.50 (1.43) 4.69%
256 39256.77 (4.48) 38994.00 (0.39) -0.67%
512 36799.03 (1.83) 38091.10 (0.63) 3.51%
1024 32903.03 (0.86) 35370.70 (0.89) 7.50%


governor teo

X pairs timers/core pull-model impact
----------------------------------------------
1 350.83 (1.27) 352.45 (0.96) 0.46%
2 699.52 (0.85) 690.10 (0.54) -1.35%
4 1339.53 (1.99) 1294.71 (2.71) -3.35%
8 2574.10 (0.76) 2495.46 (1.97) -3.06%
16 4898.50 (1.74) 4783.06 (1.64) -2.36%
32 9115.50 (4.63) 9037.83 (1.58) -0.85%
64 16663.90 (3.80) 16042.00 (1.72) -3.73%
128 25044.93 (1.11) 23250.03 (1.08) -7.17%
256 38059.53 (1.70) 39658.57 (2.98) 4.20%
512 36369.30 (0.39) 38890.13 (0.36) 6.93%
1024 33956.83 (1.14) 35514.83 (0.29) 4.59%



Ping Pong Oberservation
^^^^^^^^^^^^^^^^^^^^^^^

During testing on a mostly idle machine a ping pong game could be observed:
a process_timeout timer is expired remotely on a non idle CPU. Then the CPU
where the schedule_timeout() was executed to enqueue the timer comes out of
idle and restarts the timer using schedule_timeout() and goes back to idle
again. This is due to the fair scheduler which tries to keep the task on
the CPU which it previously executed on.



I have tested this on my 2 Arm boards with mainline kernel
and almost-mainline. On both platforms it looks stable.
The results w/ your patchset looks better.

1. rockpi4b - mainline kernel (but no UI)

Limiting the cpumask for only 4 Little CPUs and setting
performance governor for cpufreq and menu for idle.

1.1. perf bench sched pipe

w/o patchset vs. w/ patchset
avg [ops/sec]:
(more is better)
23012.33 vs. 23154.33 (+0.6%)

avg [usecs/op]:
(less is better)
43.453 vs. 43.187 (-0.6%)

1.2. perf bench sched messaging
(less is better)

w/o patchset vs. w/ patchset
avg total time [s]:
2.7855 vs. 2.7005 (-3.1%)

2. pixel6 (kernel v5.18 with backported patchset)

2.1 Speedometer 2.0 (JS test running in Chrome browser)

w/o patchset vs. w/ patchset
149 vs. 146 (-2%)

2.2 Geekbench 5
(more is better)

Single core
w/o patchset vs. w/ patchset
1025 vs. 1017 (-0.7%)

Multi core
w/o patchset vs. w/ patchset
2756 vs. 2813 (+2%)


The performance looks good. Only one test 'Speedometer'
has some interesting lower score.

Fill free to add:

Tested-by: Lukasz Luba <lukasz.luba@xxxxxxx>

Regards,
Lukasz