[RFC PATCH 0/6] x86/jump_label: Bound IPIs sent when updating a static key

From: Daniel Bristot de Oliveira
Date: Mon Oct 08 2018 - 08:53:25 EST


While tuning a system with CPUs isolated as much as possible, we've
noticed that isolated CPUs were receiving bursts of 12 IPIs, periodically.
Tracing the functions that emit IPIs, we saw chronyd - an unprivileged
process - generating the IPIs when changing a static key, enabling
network timestaping on sockets.

For instance, the trace pointed:

# trace-cmd record --func-stack -p function -l smp_call_function_single -e irq_vectors -f 'vector == 251'...
# trace-cmde report...
[...]
chronyd-858 [000] 433.286406: function: smp_call_function_single
chronyd-858 [000] 433.286408: kernel_stack: <stack trace>
=> smp_call_function_many (ffffffffbc10ab22)
=> smp_call_function (ffffffffbc10abaa)
=> on_each_cpu (ffffffffbc10ac0b)
=> text_poke_bp (ffffffffbc02436a)
=> arch_jump_label_transform (ffffffffbc020ec8)
=> __jump_label_update (ffffffffbc1b300f)
=> jump_label_update (ffffffffbc1b30ed)
=> static_key_slow_inc (ffffffffbc1b334d)
=> net_enable_timestamp (ffffffffbc625c44)
=> sock_enable_timestamp (ffffffffbc6127d5)
=> sock_setsockopt (ffffffffbc612d2f)
=> SyS_setsockopt (ffffffffbc60d5e6)
=> tracesys (ffffffffbc7681c5)
<idle>-0 [001] 433.286416: call_function_single_entry: vector=251
<idle>-0 [001] 433.286419: call_function_single_exit: vector=251

[... The IPI takes place 12 times]

The static key in case was the netstamp_needed_key. A static key
change from enabled->disabled/disabled->enabled causes the code to be
changed, and this is done in three steps:

-- Pseudo-code #1 - Current implementation ---
For each key to be updated:
1) add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)
2) update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)
3) replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---

As the static key netstamp_needed_key has four entries (used in for
places in the code) in our kernel, 3 IPIs were generated for each
entry, resulting in 12 IPIs. The number of IPIs then is linear with
regard to the number 'n' of entries of a key: O(n*3), which is O(n).

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling is _not_ running.

In this patch, rather than doing each updated at once, it is possible to
queue all updates first, and the, apply all updates at once, rewriting
the pseudo-code #1 in this way:


-- Pseudo-code #2 - This patch ---
1) for each key in the queue:
add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)

2) for each key in the queue:
update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)

3) for each key in the queue:
replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

Currently, the jump label of a static key is transformed via the arch
specific function:

void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

void arch_jump_label_transform_queue(struct jump_entry *entry,
enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated.

After queuing all jump_entries, the function:

void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

One easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ----------------------------
#!/bin/bash

while [ true ]; do
sysctl -w kernel.sched_schedstats=1
sleep 2
sysctl -w kernel.sched_schedstats=0
sleep 2
done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ----------------------------
# watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with this patch the number of IPIs goes to 3 each 2 seconds.

Although the motivation of this patch is to reduce the noise on threads
that are *not* causing the enabling/disabling of the static key,
counter-intuitively, this patch also improves the performance of the
enabling/disabling (slow) path of the thread that is actually doing the
change. The reason being is that the costs of allocating memory/
list manipulation/freeing memory are smaller than sending IPIs.
For example, in a system with 24 CPUs, the current cost of enabling the
schedstats key is 170000-ish us, while with this patch, it decreases to
2200 -ish us.

This is an RFC, so comments and critics about things I am missing are
more than welcome.

The batch of operations was suggested by Scott Wood <swood@xxxxxxxxxx>.

Daniel Bristot de Oliveira (6):
jump_label: Add for_each_label_entry helper
jump_label: Add the jump_label_can_update_check() helper
x86/jump_label: Move checking code away from __jump_label_transform()
x86/jump_label: Add __jump_label_set_jump_code() helper
x86/alternative: Split text_poke_bp() into tree steps
x86/jump_label,x86/alternatives: Batch jump label transformations

arch/x86/include/asm/jump_label.h | 2 +
arch/x86/include/asm/text-patching.h | 9 ++
arch/x86/kernel/alternative.c | 115 ++++++++++++++++---
arch/x86/kernel/jump_label.c | 161 ++++++++++++++++++++-------
include/linux/jump_label.h | 8 ++
kernel/jump_label.c | 46 ++++++--
6 files changed, 273 insertions(+), 68 deletions(-)

--
2.17.1