Re: [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor

From: Michel Lespinasse
Date: Thu Jan 10 2013 - 07:49:37 EST


On Tue, Jan 8, 2013 at 2:30 PM, Rik van Riel <riel@xxxxxxxxxx> wrote:
> v3: use fixed-point math for the delay calculations, suggested by Michel Lespinasse
>
> - if (head == ticket)
> + if (head == ticket) {
> + /*
> + * We overslept, and do not know by how.
> + * Exponentially decay the value of delay,
> + * to get it back to a good value quickly.
> + */
> + if (delay >= 2 * DELAY_FIXED_1)
> + delay -= max(delay/32, DELAY_FIXED_1);
> break;
> + }

I think delay -= (delay - MIN_SPINLOCK_DELAY) / 32 would work too, and
avoid the conditions here.

It won't make a difference in practice because delay > 32 *
DELAY_FIXED_1 on all machines I tested (even for the shortest possible
do-nothing spinlocked regions)


Also - I think we could make MIN_SPINLOCK_DELAY a tunable. There is a
limit to how fast the lock cacheline can bounce from one waiter to the
next, and we know that this limit isn't DELAY_FIXED_1. Using
DELAY_FIXED_1 is OK when we don't know of a better lower bound for a
given system, but I think it'd be nice if one could set the correct
value if they know it.



Looking at the bigger picture:

The tests I've done haven't shown any regressions over baseline. Your
v1 proposal had a tendency to oversleep but this has been fixed in v2
and v3. The only risk I can foresee, and I think this hasn't been
tested enough, would be that the algorithm might oversleep when it
encounters a mix of short and long held spinlocks. I think the proper
behavior when we encounter this is that we want to be conservative and
use the shortest encountered spinlock as our delay in that case. There
is a strong theorical case that additive increase/exponential decay
should achieve that, but I'd like to see it tested in practice.

Trying to lay down the argument why additive increase / exponential
decay should work here:
- When we oversleep, we spin (waiters_ahead * delay / DELAY_FIXED_1)
iterations (with waiters_ahead < NR_CPUS) and update new_delay = delay
* (31 / 32)
- If we keep oversleeping, delay will converge towards 0 so that the
sum of all overslept delay values will be original_delay *
sum(i=0...infinity, (31/32)^i) which is 32 * original_delay. So, the
total number of oversleeping iterations is bounded by original_delay *
(NR_CPUS * 32 / DELAY_FIXED_1)
- Each increase of delay by DELAY_FIXED_1 / 7 can thus be seen as
having a potential worst-case future cost of (NR_CPUS * 32 / 7)
oversleeping iterations.
I like that this last value is at least bounded, but it's still high
enough that I think actual behavior in the face of varying hold
durations should be investigated.



As for test results:
running 32 instances of netperf -t UDP_STREAM -H <server> -- -m 128 on
a 32 hypercores machine (2 sandybridge sockets) and htb root qdisc:
I am seeing ~650 to ~680 Mbps total with baseline 3.7, and ~860 to
~890 Mbps total with patches 1-3 applied.
(patch 4 makes the results a bit more consistent, in the ~880 to ~890
Mbps range)



Reviewed-by: Michel Lespinasse <walken@xxxxxxxxxx>

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/