[PATCH] locking/osq: Drop the overload of osq lock

From: Pan Xinhui
Date: Sat Jun 25 2016 - 09:42:51 EST

Next message: Linus Torvalds: "Re: [GIT pull] x86 fixes for 4.7"
Previous message: Chris Mason: "[GIT PULL 2/2] Btrfs"
Next in thread: Peter Zijlstra: "Re: [PATCH] locking/osq: Drop the overload of osq lock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

An over-committed guest with more vCPUs than pCPUs has a heavy overload
in osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait
per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and
unlock the osq lock. Even there is need_resched(), it did not help on
such scenario.

To fix such bad issue, add a threshold in one while-loop of osq_lock().
The value of threshold is somehow equal to SPIN_THRESHOLD.

perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09% sched-messaging [kernel.vmlinux] [k] osq_lock
12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock
3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task
3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq
3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is
2.49% sched-messaging [kernel.vmlinux] [k] system_call

after patch:
7.62% sched-messaging [kernel.kallsyms] [k] wait_consider_task
7.30% sched-messaging [kernel.kallsyms] [k] _raw_write_lock_irq
5.93% sched-messaging [kernel.kallsyms] [k] mutex_unlock
5.74% sched-messaging [unknown] [H] 0xc000000000077590
4.37% sched-messaging [kernel.kallsyms] [k] __copy_tofrom_user_powe
2.58% sched-messaging [kernel.kallsyms] [k] system_call

Signed-off-by: Pan Xinhui <xinhui.pan@xxxxxxxxxxxxxxxxxx>
---
kernel/locking/osq_lock.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..922fe5d 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -81,12 +81,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
return next;
}

+/* The threahold should take nearly 0.5ms on most archs */
+#define OSQ_SPIN_THRESHOLD (1 << 15)
+
bool osq_lock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
int curr = encode_cpu(smp_processor_id());
int old;
+ int loops;

node->locked = 0;
node->next = NULL;
@@ -118,8 +122,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
while (!READ_ONCE(node->locked)) {
/*
* If we need to reschedule bail... so we can block.
+ * An over-committed guest with more vCPUs than pCPUs
+ * might fall in this loop and cause a huge overload.
+ * This is because vCPU A(prev) hold the osq lock and yield out,
+ * vCPU B(node) wait ->locked to be set, IOW, wait till
+ * vCPU A run and unlock the osq lock.
+ * NOTE that vCPU A and vCPU B might run on same physical cpu.
*/
- if (need_resched())
+ if (need_resched() || loops++ == OSQ_SPIN_THRESHOLD)
goto unqueue;

cpu_relax_lowlatency();
--
2.4.11

Next message: Linus Torvalds: "Re: [GIT pull] x86 fixes for 4.7"
Previous message: Chris Mason: "[GIT PULL 2/2] Btrfs"
Next in thread: Peter Zijlstra: "Re: [PATCH] locking/osq: Drop the overload of osq lock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]