Re: [RFC PATCH 00/26] Runtime paravirt patching

From: Ankur Arora
Date: Fri Apr 10 2020 - 04:01:31 EST


So, first thanks for the quick comments even though some of my choices
were straight NAKs (or maybe because of that!)

Second, I clearly did a bad job of motivating the series. Let me try
to address the motivation comments first and then I can address the
technical concerns separately.

[ I'm collating all the motivation comments below. ]


A KVM host (or another hypervisor) might advertise paravirtualized
features and optimization hints (ex KVM_HINTS_REALTIME) which might
become stale over the lifetime of the guest. For instance, the

Thomas> If your host changes his advertised behaviour then you want to
Thomas> fix the host setup or find a competent admin.

Juergen> Then this hint is wrong if it can't be guaranteed.

I agree, the hint behaviour is wrong and the host shouldn't be giving
hints it can only temporarily honor.
The host problem is hard to fix though: the behaviour change is
either because of a guest migration or in case of a hosted guest,
cloud economics -- customers want to go to a 2-1 or worse VCPU-CPU
ratio at times of low load.

I had an offline discussion with Paolo Bonzini where he agreed that
it makes sense to make KVM_HINTS_REALTIME a dynamic hint rather than
static as it is now. (That was really the starting point for this
series.)

host might go from being undersubscribed to being oversubscribed
(or the other way round) and it would make sense for the guest
switch pv-ops based on that.

Juergen> I think using pvops for such a feature change is just wrong.
Juergen> What comes next? Using pvops for being able to migrate a guest
Juergen> from an Intel to an AMD machine?

My statement about switching pv-ops was too broadly worded. What
I meant to say was that KVM guests choose pv_lock_ops to be native
or paravirt based on undersubscribed/oversubscribed hint at boot,
and this choice should be available at run-time as well.

KVM chooses between native/paravirt spinlocks at boot based on this
reasoning (from commit b2798ba0b8):
"Waiman Long mentioned that:
Generally speaking, unfair lock performs well for VMs with a small
number of vCPUs. Native qspinlock may perform better than pvqspinlock
if there is vCPU pinning and there is no vCPU over-commitment.
"

PeterZ> So what, the paravirt spinlock stuff works just fine when
PeterZ> you're not oversubscribed.
Yeah, the paravirt spinlocks work fine for both under and oversubscribed
hosts, but they are more expensive and that extra cost provides no benefits
when CPUs are pinned.
For instance, pvqueued spin_unlock() is a call+locked cmpxchg as opposed
to just a movb $0, (%rdi).

This difference shows up in kernbench running on a KVM guest with native
and paravirt spinlocks. I ran with 8 and 64 CPU guests with CPUs pinned.

The native version performs same or better.

8 CPU Native (std-dev) Paravirt (std-dev)
----------------- -----------------
-j 4: sys 151.89 ( 0.2462) 160.14 ( 4.8366) +5.4%
-j 32: sys 162.715 (11.4129) 170.225 (11.1138) +4.6%
-j 0: sys 164.193 ( 9.4063) 170.843 ( 8.9651) +4.0%


64 CPU Native (std-dev) Paravirt (std-dev)
----------------- -----------------
-j 32: sys 209.448 (0.37009) 210.976 (0.4245) +0.7%
-j 256: sys 267.401 (61.0928) 285.73 (78.8021) +6.8%
-j 0: sys 286.313 (56.5978) 307.721 (70.9758) +7.4%

In all cases the pv_kick, pv_wait numbers were minimal as expected.
The lock_slowpath counts were higher with PV but AFAICS the native
and paravirt lock_slowpath are not directly comparable.

Detailed kernbench numbers attached.

Thanks
Ankur
8-cpu-pinned,native
==================

Average Half load -j 4 Run (std deviation):
Elapsed Time 303.686 (0.737652)
User Time 1032.24 (2.8133)
System Time 151.89 (0.246272)
Percent CPU 389.2 (0.447214)
Context Switches 19350.4 (82.1785)
Sleeps 125885 (148.338)

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 187.068 (0.358427)
User Time 1130.33 (103.405)
System Time 162.715 (11.4129)
Percent CPU 569.1 (189.633)
Context Switches 143301 (130656)
Sleeps 126938 (1132.83)

Average Maximal load -j Run (std deviation):
Elapsed Time 189.098 (0.316812)
User Time 1166.59 (98.4454)
System Time 164.193 (9.4063)
Percent CPU 627.133 (174.169)
Context Switches 222270 (156005)
Sleeps 122562 (6470.93)

8-cpu-pinned, pv
================

Average Half load -j 4 Run (std deviation):
Elapsed Time 309.872 (5.882)
User Time 1045.8 (18.5295)
System Time 160.14 (4.83669)
Percent CPU 388.8 (0.447214)
Context Switches 41215.4 (679.522)
Sleeps 122369 (477.593)

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 190.1 (0.377823)
User Time 1144 (104.248)
System Time 170.225 (11.1138)
Percent CPU 568.2 (189.107)

Average Maximal load -j Run (std deviation):
Elapsed Time 191.606 (0.108305)
User Time 1178.83 (97.908)
System Time 170.843 (8.9651)
Percent CPU 625.8 (173.49)
Context Switches 234878 (149479)
Sleeps 120542 (6073.79)
64-cpu-pinned, native
=====================

Average Half load -j 32 Run (std deviation):
Elapsed Time 54.306 (0.134833)
User Time 1072.75 (1.34598)
System Time 209.448 (0.370095)
Percent CPU 2360.4 (4.03733)
Context Switches 26999 (99.5414)
Sleeps 122408 (184.87)

Average Optimal load -j 256 Run (std deviation):
Elapsed Time 39.424 (0.150599)
User Time 1140.91 (71.8722)
System Time 267.401 (61.0928)
Percent CPU 3125.9 (806.96)
Context Switches 129662 (108217)
Sleeps 121767 (699.198)

Average Maximal load -j Run (std deviation):
Elapsed Time 41.562 (0.206083)
User Time 1174.68 (75.9342)
System Time 286.313 (56.5978)
Percent CPU 3339.87 (719.062)
Context Switches 203428 (138536)
Sleeps 119066 (3993.58)

64-cpu-pinned, pv
================
Average Half load -j 32 Run (std deviation):
Elapsed Time 55.14 (0.0894427)
User Time 1071.99 (1.43335)
System Time 210.976 (0.424594)
Percent CPU 2326 (4.52769)
Context Switches 37544.8 (220.969)
Sleeps 115527 (94.7138)

Average Optimal load -j 256 Run (std deviation):
Elapsed Time 40.54 (0.246779)
User Time 1137.41 (68.9773)
System Time 285.73 (78.8021)
Percent CPU 3090.7 (806.218)
Context Switches 139059 (107006)
Sleeps 116962 (1518.56)

Average Maximal load -j Run (std deviation):
Elapsed Time 42.682 (0.170939)
User Time 1171.64 (74.6663)
System Time 307.721 (70.9758)
Percent CPU 3303.27 (717.418)
Context Switches 213430 (138616)
Sleeps 115143 (2930.03)