Re: [PATCH v5 0/7] Add latency priority for CFS class

From: K Prateek Nayak
Date: Mon Oct 17 2022 - 02:48:14 EST


Hello Vincent,

Thank you for taking a look at the report.

On 10/13/2022 8:54 PM, Vincent Guittot wrote:
> Hi Prateek,
>
> Thanks for testing the patchset on AMD and the test report below.
>
> On Wed, 12 Oct 2022 at 16:54, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>>
>> [..snip..]
>>
>> - Socket (Process)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 6.44 (0.00 pct) 5.50 (14.59 pct) ^ 6.43 (0.15 pct)
>> 2-groups: 6.55 (0.00 pct) 5.56 (15.11 pct) ^ 6.36 (2.90 pct)
>> 4-groups: 6.74 (0.00 pct) 6.19 (8.16 pct) ^ 6.69 (0.74 pct)
>> 8-groups: 8.03 (0.00 pct) 8.29 (-3.23 pct) 8.02 (0.12 pct)
>> 16-groups: 12.25 (0.00 pct) 14.11 (-15.18 pct) 12.41 (-1.30 pct)
>
> I don't see any improvement with LN:-20 but only for LN:19
>
> How many iterations do you run ? Could it be that the results vary
> between iterations ? For some configuration I have a stddev of 10-20%
> for LN:0 and LN:-20
>

Yes I do see a lot of run to run variation for the above runs:

For 1-group:

LN: : 0 -20 19
Min : 6.26 4.97 6.28
Max : 6.54 6.71 6.55
Median : 6.45 5.28 6.43
AMean : 6.44 5.50 6.43
GMean : 6.44 5.47 6.43
HMean : 6.44 5.44 6.43
AMean Stddev : 0.08 0.60 0.08
AMean CoefVar : 1.18 pct 10.89 pct 1.28 pct

For 2-group:

LN: : 0 -20 19
Min : 5.80 5.38 5.28
Max : 6.80 6.70 6.32
Median : 6.66 6.53 5.48
AMean : 6.55 6.36 5.56
GMean : 6.55 6.35 5.55
HMean : 6.54 6.33 5.54
AMean Stddev : 0.29 0.41 0.33
AMean CoefVar : 4.38 pct 6.48 pct 5.99 pct

I've rerun this data point and following are the results:

- Socket (Process) (Loop: 100000)

Test: LN:0 LN:-20 LN:19
1-groups: 6.81 (0.00 pct) 6.62 (2.79 pct) 6.62 (2.79 pct)
2-groups: 6.76 (0.00 pct) 6.69 (1.03 pct) 6.65 (1.62 pct)
4-groups: 6.62 (0.00 pct) 6.65 (-0.45 pct) 6.63 (-0.15 pct)
8-groups: 7.84 (0.00 pct) 7.81 (0.38 pct) 7.78 (0.76 pct)
16-groups: 12.87 (0.00 pct) 12.40 (3.65 pct) 12.35 (4.04 pct)

Results are more stable in these runs but runs with LN: -20,
have comparatively larger Stddev compared to LN: 0 and LN: 19

>>
>>> Loops: 2160 (Same as in testing)
>>
>> - Pipe (Thread)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 0.10 (0.00 pct) 0.12 (-20.00 pct) 0.10 (0.00 pct)
>> 2-groups: 0.12 (0.00 pct) 0.15 (-25.00 pct) 0.11 (8.33 pct)
>> 4-groups: 0.14 (0.00 pct) 0.18 (-28.57 pct) 0.15 (-7.14 pct)
>> 8-groups: 0.17 (0.00 pct) 0.24 (-41.17 pct) 0.17 (0.00 pct)
>> 16-groups: 0.26 (0.00 pct) 0.33 (-26.92 pct) 0.21 (19.23 pct)
>>
>> - Pipe (Process)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 0.10 (0.00 pct) 0.12 (-20.00 pct) 0.10 (0.00 pct)
>> 2-groups: 0.12 (0.00 pct) 0.16 (-33.33 pct) 0.12 (0.00 pct)
>> 4-groups: 0.14 (0.00 pct) 0.17 (-21.42 pct) 0.13 (7.14 pct)
>> 8-groups: 0.16 (0.00 pct) 0.24 (-50.00 pct) 0.16 (0.00 pct)
>> 16-groups: 0.23 (0.00 pct) 0.33 (-43.47 pct) 0.19 (17.39 pct)
>>
>> - Socket (Thread)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 0.19 (0.00 pct) 0.18 (5.26 pct) 0.18 (5.26 pct)
>> 2-groups: 0.21 (0.00 pct) 0.21 (0.00 pct) 0.20 (4.76 pct)
>> 4-groups: 0.22 (0.00 pct) 0.25 (-13.63 pct) 0.22 (0.00 pct)
>> 8-groups: 0.27 (0.00 pct) 0.36 (-33.33 pct) 0.27 (0.00 pct)
>> 16-groups: 0.42 (0.00 pct) 0.55 (-30.95 pct) 0.40 (4.76 pct)
>>
>> - Socket (Process)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 0.17 (0.00 pct) 0.17 (0.00 pct) 0.17 (0.00 pct)
>> 2-groups: 0.19 (0.00 pct) 0.20 (-5.26 pct) 0.19 (0.00 pct)
>> 4-groups: 0.20 (0.00 pct) 0.22 (-10.00 pct) 0.20 (0.00 pct)
>> 8-groups: 0.25 (0.00 pct) 0.32 (-28.00 pct) 0.25 (0.00 pct)
>> 16-groups: 0.40 (0.00 pct) 0.51 (-27.50 pct) 0.39 (2.50 pct)
>>
>> o Hackbench and Cyclictest in NPS1 configuration
>>
>> perf bench sched messaging -p -t -l 100000 -g 16&
>> cyclictest --policy other -D 5 -q -n -H 20000
>>
>> -----------------------------------------------------------------------------------------------------------------
>> |Hackbench | Cyclictest LN = 19 | Cyclictest LN = 0 | Cyclictest LN = -20 |
>> |LN |--------------------------------|---------------------------------|-----------------------------|
>> |v | Min | Avg | Max | Min | Avg | Max | Min | Avg | Max |
>> |--------------|--------|---------|-------------|----------|---------|------------|----------|---------|--------|
>> |0 | 54.00 | 117.00 | 3021.67 | 53.67 | 65.33 | 133.00 | 53.67 | 65.00 | 201.33 | ^
>> |19 | 50.00 | 100.67 | 3099.33 | 41.00 | 64.33 | 1014.33 | 54.00 | 63.67 | 213.33 |
>> |-20 | 53.00 | 169.00 | 11661.67 | 53.67 | 217.33 | 14313.67 | 46.00 | 61.33 | 236.00 | ^
>> -----------------------------------------------------------------------------------------------------------------
>
> The latency results look good with Cyclictest LN:0 and hackbench LN:0.
> 133us max latency. This suggests that your system is not overloaded
> and cyclictest doesn't really compete with others to run.

I'll get data while running hanckbench with larger number
of groups. I'll look out for larger latency in LN: (0, 0)
case to check for CPU contention.

>
>>
>> o Hackbench and schbench in NPS1 configuration
>>
>> perf bench sched messaging -p -t -l 1000000 -g 16&
>> schebcnh -m 1 -t 64 -s 30s
>>
>> ------------------------------------------------------------------------------------------------------------
>> |Hackbench | schbench LN = 19 | schbench LN = 0 | schbench LN = -20 |
>> |LN |----------------------------|--------------------------------|-----------------------------|
>> |v | 90th | 95th | 99th | 90th | 95th | 99th | 90th | 95th | 99th |
>> |--------------|--------|--------|----------|---------|---------|------------|---------|----------|--------|
>> |0 | 4264 | 6744 | 15664 | 17952 | 32672 | 55488 | 15088 | 25312 | 50112 |
>> |19 | 288 | 613 | 2332 | 274 | 1015 | 3628 | 374 | 1394 | 4424 |
>> |-20 | 35904 | 47680 | 79744 | 87168 | 113536 | 176896 | 13008 | 21216 | 42560 | ^
>> ------------------------------------------------------------------------------------------------------------
>
> For the schbench, your test is 30 seconds long which is longer than
> the duration of perf bench sched messaging -p -t -l 1000000 -g 16&

With loop size 1 million, I see the schbench complete before
hackbench in all the cases. I'll rerun this with larger group
size too to get more data and make sure hackbench runs longer
than schbench in all cases.

>
> The duration of the latter varies depending of latency nice value so
> schbench is disturb more time in some cases
>>
>> o SpecJBB Multi-JVM
>>
>> ---------------------------------------------
>> | Latency Nice | 0 | 19 |
>> ---------------------------------------------
>> | max-jOPS | 100% | 109.92% |
>> | critical-jOPS | 100% | 153.70% |
>> ---------------------------------------------
>>
>> In most cases, latency nice delivers what it promises.
>> Some cases marked with "^" have shown anomalies or non-linear behavior
>> that is yet to be root caused. If you've seen something similar during
>> your testing, I would love to know what could lead to such a behavior.
>
> I haven't seen anything like the results that you tagged with ^. As a
> side note, the numbers of groups (g1 g4 g8 g1) that I used with
> hackbench, have been chosen according to my 8 cores system. Your
> system is much larger and hackbench may not overload it with such a
> small number of groups. Maybe you could try with g32 g64 g128 g256 ?
>

I agree. I'll get the data for cyclictest and schbench with hackbench
running larger number of groups alongside.

>
>>
>> If you would like more details on the benchmarks results reported above
>> or if there is any specific workload you would like me to test on the
>> Zen3 machine, please do let me know.
>>
>>>
>>> [..snip..]
>>>
--
Thanks and Regards,
Prateek