Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll

From: quan.xu04@xxxxxxxxx
Date: Tue Nov 14 2017 - 22:15:47 EST




On 2017å11æ14æ 15:44, Ingo Molnar wrote:
* Quan Xu <quan.xu0@xxxxxxxxx> wrote:


On 2017/11/13 23:08, Ingo Molnar wrote:
* Quan Xu <quan.xu04@xxxxxxxxx> wrote:

From: Quan Xu <quan.xu0@xxxxxxxxx>

To reduce the cost of poll, we introduce three sysctl to control the
poll time when running as a virtual machine with paravirt.

Signed-off-by: Yang Zhang <yang.zhang.wz@xxxxxxxxx>
Signed-off-by: Quan Xu <quan.xu0@xxxxxxxxx>
---
Documentation/sysctl/kernel.txt | 35 +++++++++++++++++++++++++++++++++++
arch/x86/kernel/paravirt.c | 4 ++++
include/linux/kernel.h | 6 ++++++
kernel/sysctl.c | 34 ++++++++++++++++++++++++++++++++++
4 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 694968c..30c25fb 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this one.
==============================================================
+paravirt_poll_grow: (X86 only)
+
+Multiplied value to increase the poll time. This is expected to take
+effect only when running as a virtual machine with CONFIG_PARAVIRT
+enabled. This can't bring any benifit on bare mental even with
+CONFIG_PARAVIRT enabled.
+
+By default this value is 2. Possible values to set are in range {2..16}.
+
+==============================================================
+
+paravirt_poll_shrink: (X86 only)
+
+Divided value to reduce the poll time. This is expected to take effect
+only when running as a virtual machine with CONFIG_PARAVIRT enabled.
+This can't bring any benifit on bare mental even with CONFIG_PARAVIRT
+enabled.
+
+By default this value is 2. Possible values to set are in range {2..16}.
+
+==============================================================
+
+paravirt_poll_threshold_ns: (X86 only)
+
+Controls the maximum poll time before entering real idle path. This is
+expected to take effect only when running as a virtual machine with
+CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental
+even with CONFIG_PARAVIRT enabled.
+
+By default, this value is 0 means not to poll. Possible values to set
+are in range {0..500000}. Change the value to non-zero if running
+latency-bound workloads in a virtual machine.
I absolutely hate it how this hybrid idle loop polling mechanism is not
self-tuning!
Ingo, actually it is self-tuning..
Then why the hell does it touch the syscall ABI?


just for more data about performance and CPU utilization with different
the maximum poll time.

there are 3 parameters, paravirt_poll_{grow|shrink|threshold_ns}..
we didn't touch paravirt_poll_{grow|shrink} since we sent out v1.

We tested it based on benchmark contextswitch / netperf with different
paravirt_poll_threshold_ns.

Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):
ÂÂÂÂÂ halt_poll_threshold=0ÂÂÂÂÂ -- 3402.9 ns/ctxsw -- 199.8 %CPU
ÂÂÂÂÂ halt_poll_threshold=10000Â -- 1151.4 ns/ctxsw -- 200.1 %CPU
ÂÂÂÂÂ halt_poll_threshold=20000Â -- 1149.7 ns/ctxsw -- 199.9 %CPU
ÂÂÂÂÂ halt_poll_threshold=30000Â -- 1151.0 ns/ctxsw -- 199.9 %CPU
ÂÂÂÂÂ halt_poll_threshold=40000Â -- 1155.4 ns/ctxsw -- 199.3 %CPU
ÂÂÂÂÂ halt_poll_threshold=50000Â -- 1161.0 ns/ctxsw -- 200.0 %CPU
ÂÂÂÂÂ halt_poll_threshold=100000 -- 1163.8 ns/ctxsw -- 200.4 %CPU
ÂÂÂÂÂ halt_poll_threshold=200000 -- 1163.8 ns/ctxsw -- 201.4 %CPU
ÂÂÂÂÂ halt_poll_threshold=300000 -- 1159.4 ns/ctxsw -- 201.9 %CPU
ÂÂÂÂÂ halt_poll_threshold=500000 -- 1163.5 ns/ctxsw -- 205.5 %CPU


Here is the data we get when running benchmark netperf:
ÂÂÂÂÂ halt_poll_threshold=0ÂÂÂÂÂ -- 29031.6 bit/s -- 76.1Â %CPU
ÂÂÂÂÂ halt_poll_threshold=10000Â -- 29021.7 bit/s -- 105.1 %CPU
ÂÂÂÂÂ halt_poll_threshold=20000Â -- 33463.5 bit/s -- 128.2 %CPU
ÂÂÂÂÂ halt_poll_threshold=30000Â -- 34436.4 bit/s -- 127.8 %CPU
ÂÂÂÂÂ halt_poll_threshold=40000Â -- 35563.3 bit/s -- 129.6 %CPU
ÂÂÂÂÂ halt_poll_threshold=50000Â -- 35787.7 bit/s -- 129.4 %CPU
ÂÂÂÂÂ halt_poll_threshold=100000 -- 35477.7 bit/s -- 130.0 %CPU
ÂÂÂÂÂ halt_poll_threshold=200000 -- 35877.7 bit/s -- 131.0 %CPU
ÂÂÂÂÂ halt_poll_threshold=300000 -- 35730.0 bit/s -- 132.4 %CPU
ÂÂÂÂÂ halt_poll_threshold=500000 -- 34978.4 bit/s -- 134.2 %CPU


and think of the default value(200000, for x86) of kvm dynamic poll,
I'll set it as the same as kvm dynamic poll.

I also test idle VM with diffrent halt_poll_threshold, which doesn't
make CPU utilization fluctuated..


could I only leave paravirt_poll_threshold_ns parameter (the maximum poll time),
which is as similar as "adaptive halt-polling" Wanpeng mentioned.. then user can
turn it off, or find an appropriate threshold for some odd scenario..
That way lies utter madness. Maybe add it as a debugfs knob, but exposing it to
userspace: NAK.

.. so, I will make these 3 parameters by default in next v4.
ÂÂÂÂ paravirt_poll_threshold_ns = 200000
ÂÂÂÂ paravirt_poll_shrink = 2
ÂÂÂÂ paravirt_poll_grow = 2

neither touch the syscal ABI nor expose it to userspace again.


Quan