On Fri, Dec 21, 2012 at 06:56:13PM -0500, Rik van Riel wrote:diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 4e44840..e44c56f 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
static bool smp_no_nmi_ipi = false;
/*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. Adjusting "delay" to poll, on average, 2.7 times before the
+ * lock is obtained seems to result in low bus traffic. The combination
+ * of aiming for a non-integer amount of average polls, and scaling the
+ * sleep period proportionally to how many CPUs are ahead of us in the
+ * queue for this ticket lock seems to reduce the amount of time spent
+ * "oversleeping" the release of the lock.
*/
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 1000
+DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
{
+ /*
+ * Use the raw per-cpu pointer; preemption is disabled in the
+ * spinlock code. This avoids put_cpu_var once we have the lock.
+ */
+ int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
+ int delay = *delay_ptr;
I'm confused by the above comment. Why not just:
int delay = this_cpu_read(spinlock_delay);
?
Too bad you posted this just before break. I currently have access to a
40 core box, and I would have loved to test this. But right now I have
it testing other things, and hopefully I'll still have access to it
after the break.