Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: Gautham R. Shenoy
Date: Mon Jun 26 2023 - 02:05:15 EST


Hello Peter, David,

On Fri, Jun 23, 2023 at 03:20:15PM +0530, Gautham R. Shenoy wrote:
> On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:

>
> I will post more results later.

I was able to get some numbers for hackbench, schbench (old), and
tbench over the weekend on a 2 Socket Zen3 box with 64 cores 128
threads per socket configured in NPS1 mode.

The legend is as follows:

tip : tip/sched/core with HEAD being commit e2a1f85bf9f5 ("sched/psi:
Avoid resetting the min update period when it is unnecessary")


david : This patchset

david-ego-1 : David's patchset + my modification to allow SIS signal
that a task should be queued on the shared-wakequeue when SIS cannot
find an idle CPU to wake up the task.

david-ego-2 : David's patchset + david-ego-1 + my modification to
remove the first task from the shared-wakequeue whose
cpus_allowed contains this CPU. Currently we don't do
this check and always remove the first task.


david-ego-1 and david-ego-2 are attached with this mail.

hackbench (Measure: time taken to complete, in seconds)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test: tip david david-ego-1 david-ego-2
1-groups: 3.92 (0.00 pct) 3.35 (14.54 pct) 3.53 (9.94 pct) 3.30 (15.81 pct)
2-groups: 4.58 (0.00 pct) 3.89 (15.06 pct) 3.95 (13.75 pct) 3.79 (17.24 pct)
4-groups: 4.99 (0.00 pct) 4.42 (11.42 pct) 4.76 (4.60 pct) 4.77 (4.40 pct)
8-groups: 5.67 (0.00 pct) 5.08 (10.40 pct) 6.16 (-8.64 pct) 6.33 (-11.64 pct)
16-groups: 7.88 (0.00 pct) 7.32 (7.10 pct) 8.57 (-8.75 pct) 9.77 (-23.98 pct)


Observation: We see that David's patchset does very well across all
the groups. Expanding the scope of the shared-wakequeue with
david-ego-1 doesn't give us much and in fact hurts at higher
utilization. Same is the case with david-ego-2 which only pulls
allowed tasks from the shared-wakequeue. In david-ego-2 we see a
greater amount of spin-lock contention for 8 and 16 groups, as the
code holds the spinlock and iterates through the list members while
checking cpu-affinity.

So, David's original patchset wins this one.




schbench (Measure : 99th Percentile latency, in us)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip david david-ego-1 david-ego-2
1: 26.00 (0.00 pct) 21.00 (19.23 pct) 28.00 (-7.69 pct) 22.00 (15.38 pct)
2: 27.00 (0.00 pct) 29.00 (-7.40 pct) 28.00 (-3.70 pct) 30.00 (-11.11 pct)
4: 31.00 (0.00 pct) 31.00 (0.00 pct) 31.00 (0.00 pct) 28.00 (9.67 pct)
8: 36.00 (0.00 pct) 37.00 (-2.77 pct) 34.00 (5.55 pct) 39.00 (-8.33 pct)
16: 49.00 (0.00 pct) 49.00 (0.00 pct) 48.00 (2.04 pct) 50.00 (-2.04 pct)
32: 80.00 (0.00 pct) 80.00 (0.00 pct) 88.00 (-10.00 pct) 79.00 (1.25 pct)
64: 169.00 (0.00 pct) 180.00 (-6.50 pct) 174.00 (-2.95 pct) 168.00 (0.59 pct)
128: 343.00 (0.00 pct) 355.00 (-3.49 pct) 356.00 (-3.79 pct) 344.00 (-0.29 pct)
256: 42048.00 (0.00 pct) 46528.00 (-10.65 pct) 51904.00 (-23.43 pct) 48064.00 (-14.30 pct)
512: 95104.00 (0.00 pct) 95872.00 (-0.80 pct) 95360.00 (-0.26 pct) 97152.00 (-2.15 pct)


Observations: There are run-to-run variations with this benchmark. I
will try with the newer schbench later this week.

tbench (Measure: Throughput, records/s)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clients: tip sis-node david david-ego-1 ego-david-2
1 452.49 (0.00 pct) 457.94 (1.20 pct) 448.52 (-0.87 pct) 447.11 (-1.18 pct) 458.45 (1.31 pct)
2 862.44 (0.00 pct) 879.99 (2.03 pct) 860.14 (-0.26 pct) 873.27 (1.25 pct) 891.72 (3.39 pct)
4 1604.27 (0.00 pct) 1618.87 (0.91 pct) 1610.95 (0.41 pct) 1628.45 (1.50 pct) 1657.26 (3.30 pct)
8 2966.77 (0.00 pct) 3040.90 (2.49 pct) 2991.07 (0.81 pct) 3063.31 (3.25 pct) 3106.50 (4.70 pct)
16 5176.70 (0.00 pct) 5292.29 (2.23 pct) 5478.32 (5.82 pct) 5462.05 (5.51 pct) 5537.15 (6.96 pct)
32 8205.24 (0.00 pct) 8949.12 (9.06 pct) 9039.63 (10.16 pct) 9466.07 (15.36 pct) 9365.06 (14.13 pct)
64 13956.71 (0.00 pct) 14461.42 (3.61 pct) 16337.65 (17.05 pct) 16941.63 (21.38 pct) 15697.47 (12.47 pct)
128 24005.50 (0.00 pct) 26052.75 (8.52 pct) 25605.24 (6.66 pct) 27243.19 (13.48 pct) 24854.60 (3.53 pct)
256 32457.61 (0.00 pct) 21999.41 (-32.22 pct) 36953.22 (13.85 pct) 32299.31 (-0.48 pct) 33037.03 (1.78 pct)
512 34345.24 (0.00 pct) 41166.39 (19.86 pct) 40845.23 (18.92 pct) 40797.97 (18.78 pct) 38150.17 (11.07 pct)
1024 33432.92 (0.00 pct) 40900.84 (22.33 pct) 39749.35 (18.89 pct) 41133.82 (23.03 pct) 38464.26 (15.04 pct)


Observations: tbench really likes all variants of shared-wakeueue. I
have also included sis-node numbers since we saw that tbench liked
sis-node.

Also, it can be noted that except for the 256 clients case (number of
clients == number of threads in the system), in all other cases, we
see a benefit with david-ego-1 which extends the usage of
shared-wakequeue to the waker's target when the waker's LLC is busy.

Will try and get the netperf, postgresql, SPECjbb and Deathstarbench
numbers this week.

--
Thanks and Regards
gautham.