Re: [PATCH 0/4] Powerpc: Better preemption for shared processor

From: Waiman Long
Date: Wed Oct 28 2020 - 20:02:13 EST


On 10/28/20 8:35 AM, Srikar Dronamraju wrote:
Currently, vcpu_is_preempted will return the yield_count for
shared_processor. On a PowerVM LPAR, Phyp schedules at SMT8 core boundary
i.e all CPUs belonging to a core are either group scheduled in or group
scheduled out. This can be used to better predict non-preempted CPUs on
PowerVM shared LPARs.

perf stat -r 5 -a perf bench sched pipe -l 10000000 (lesser time is better)

powerpc/next
35,107,951.20 msec cpu-clock # 255.898 CPUs utilized ( +- 0.31% )
23,655,348 context-switches # 0.674 K/sec ( +- 3.72% )
14,465 cpu-migrations # 0.000 K/sec ( +- 5.37% )
82,463 page-faults # 0.002 K/sec ( +- 8.40% )
1,127,182,328,206 cycles # 0.032 GHz ( +- 1.60% ) (66.67%)
78,587,300,622 stalled-cycles-frontend # 6.97% frontend cycles idle ( +- 0.08% ) (50.01%)
654,124,218,432 stalled-cycles-backend # 58.03% backend cycles idle ( +- 1.74% ) (50.01%)
834,013,059,242 instructions # 0.74 insn per cycle
# 0.78 stalled cycles per insn ( +- 0.73% ) (66.67%)
132,911,454,387 branches # 3.786 M/sec ( +- 0.59% ) (50.00%)
2,890,882,143 branch-misses # 2.18% of all branches ( +- 0.46% ) (50.00%)

137.195 +- 0.419 seconds time elapsed ( +- 0.31% )

powerpc/next + patchset
29,981,702.64 msec cpu-clock # 255.881 CPUs utilized ( +- 1.30% )
40,162,456 context-switches # 0.001 M/sec ( +- 0.01% )
1,110 cpu-migrations # 0.000 K/sec ( +- 5.20% )
62,616 page-faults # 0.002 K/sec ( +- 3.93% )
1,430,030,626,037 cycles # 0.048 GHz ( +- 1.41% ) (66.67%)
83,202,707,288 stalled-cycles-frontend # 5.82% frontend cycles idle ( +- 0.75% ) (50.01%)
744,556,088,520 stalled-cycles-backend # 52.07% backend cycles idle ( +- 1.39% ) (50.01%)
940,138,418,674 instructions # 0.66 insn per cycle
# 0.79 stalled cycles per insn ( +- 0.51% ) (66.67%)
146,452,852,283 branches # 4.885 M/sec ( +- 0.80% ) (50.00%)
3,237,743,996 branch-misses # 2.21% of all branches ( +- 1.18% ) (50.01%)

117.17 +- 1.52 seconds time elapsed ( +- 1.30% )

This is around 14.6% improvement in performance.

Cc: linuxppc-dev <linuxppc-dev@xxxxxxxxxxxxxxxx>
Cc: LKML <linux-kernel@xxxxxxxxxxxxxxx>
Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx>
Cc: Nicholas Piggin <npiggin@xxxxxxxxx>
Cc: Nathan Lynch <nathanl@xxxxxxxxxxxxx>
Cc: Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Valentin Schneider <valentin.schneider@xxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Phil Auld <pauld@xxxxxxxxxx>

Srikar Dronamraju (4):
powerpc: Refactor is_kvm_guest declaration to new header
powerpc: Rename is_kvm_guest to check_kvm_guest
powerpc: Reintroduce is_kvm_guest
powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted

arch/powerpc/include/asm/firmware.h | 6 ------
arch/powerpc/include/asm/kvm_guest.h | 25 +++++++++++++++++++++++++
arch/powerpc/include/asm/kvm_para.h | 2 +-
arch/powerpc/include/asm/paravirt.h | 18 ++++++++++++++++++
arch/powerpc/kernel/firmware.c | 5 ++++-
arch/powerpc/platforms/pseries/smp.c | 3 ++-
6 files changed, 50 insertions(+), 9 deletions(-)
create mode 100644 arch/powerpc/include/asm/kvm_guest.h

This patch series looks good to me and the performance is nice too.

Acked-by: Waiman Long <longman@xxxxxxxxxx>

Just curious, is the performance mainly from the use of static_branch (patches 1 - 3) or from reducing call to yield_count_of().

Cheers,
Longman