Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
From: Peter Zijlstra
Date: Mon Jun 27 2022 - 03:50:30 EST
On Mon, Jun 27, 2022 at 10:13:50AM +0800, Guo Hui wrote:
> The instructions assigned to the vcpu_is_preempted function parameter
> in the X86 architecture physical machine are redundant instructions,
> causing the multi-core performance of Unixbench to drop by about 4% to 5%.
> The C function is as follows:
> static bool vcpu_is_preempted(long vcpu);
>
> The parameter 'vcpu' in the function osq_lock
> that calls the function vcpu_is_preempted is assigned as follows:
>
> The C code is in the function node_cpu:
> cpu = node->cpu - 1;
>
> The instructions corresponding to the C code are:
> mov 0x14(%rax),%edi
> sub $0x1,%edi
>
> The above instructions are unnecessary
> in the X86 Native operating environment,
> causing high cache-misses and degrading performance.
The above basically says that argument setup is not patched out and
causes significant pain due to a cache-miss.
> Signed-off-by: Guo Hui <guohui@xxxxxxxxxxxxx>
> ---
> arch/x86/kernel/paravirt-spinlocks.c | 4 ++++
> kernel/locking/osq_lock.c | 9 ++++++++-
> 2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
> index 9e1ea99ad..7a55f8407 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -33,6 +33,8 @@ bool pv_is_native_vcpu_is_preempted(void)
> __raw_callee_save___native_vcpu_is_preempted;
> }
>
> +DECLARE_STATIC_KEY_FALSE(preemted_key);
> +
> void __init paravirt_set_cap(void)
> {
> if (!pv_is_native_spin_unlock())
> @@ -40,4 +42,6 @@ void __init paravirt_set_cap(void)
>
> if (!pv_is_native_vcpu_is_preempted())
> setup_force_cpu_cap(X86_FEATURE_VCPUPREEMPT);
> + else
> + static_branch_enable(&preemted_key);
> }
At least for x86 it makes sense to have the static_key default the other
way around. That is, enable it along with vcpu_is_preempted().
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52..a8798e701 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -22,9 +22,16 @@ static inline int encode_cpu(int cpu_nr)
> return cpu_nr + 1;
> }
>
> +DEFINE_STATIC_KEY_FALSE(preemted_key);
> +
> static inline int node_cpu(struct optimistic_spin_node *node)
> {
> - return node->cpu - 1;
> + int cpu = 0;
> +
> + if (!static_branch_unlikely(&preemted_key))
> + cpu = node->cpu - 1;
> +
> + return cpu;
> }
Would not something like:
static inline bool
vcpu_is_preempted_node(struct optimistic_spin_node *node)
{
if (!static_branch_unlikely(&vcpu_has_preemption))
return false;
return vcpu_is_preempted(node_cpu(node->prev));
}
And then use that like:
if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
vcpu_is_preempted_node(node)))
Not generate better code still?