[PATCH 0/6] x86-64: Micro-optimize vclock_gettime

From: Andy Lutomirski
Date: Mon Mar 28 2011 - 11:12:12 EST


This series speeds up vclock_gettime(CLOCK_MONOTONIC) on by almost 30%
(tested on Sandy Bridge). They're ordered in roughly decreasing order
of improvement.

These are meant for 2.6.40, but if anyone wants to take some of them
for 2.6.39 I won't object.

The changes and timings (fastest of 20 trials of 100M iters on Sandy
Bridge) are:

Unpatched:

CLOCK_MONOTONIC: 22.09ns
CLOCK_REALTIME_COARSE: 4.23ns
CLOCK_MONOTONIC_COARSE: 5.65ns

x86-64: Optimize vread_tsc's barriers

This replaces lfence;rdtsc;lfence with a faster sequence with similar
ordering guarantees.

CLOCK_MONOTONIC: 18.28ns
CLOCK_REALTIME_COARSE: 4.23ns
CLOCK_MONOTONIC_COARSE: 5.98ns

x86-64: Don't generate cmov in vread_tsc

GCC likes to generate a cmov on a branch that's almost completely
predictable. Force it to generate a real branch instead.

CLOCK_MONOTONIC: 16.30ns
CLOCK_REALTIME_COARSE: 4.23ns
CLOCK_MONOTONIC_COARSE: 5.95ns

x86-64: Put vsyscall_gtod_data at a fixed virtual address

Because vsyscall_gtod_data's address isn't known until load time, the
code contains unnecessary address calculations. Hardcode it. This is
a nice speedup for the _COARSE variants as well.

CLOCK_MONOTONIC: 16.12ns
CLOCK_REALTIME_COARSE: 3.70ns
CLOCK_MONOTONIC_COARSE: 5.31ns

x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0

vset_normalize_timespec was more general than necessary. Open-code
the appropriate normalization loops. This is a big win for
CLOCK_MONOTONIC_COARSE

CLOCK_MONOTONIC: 16.09ns
CLOCK_REALTIME_COARSE: 3.70ns
CLOCK_MONOTONIC_COARSE: 4.49ns

x86-64: Omit frame pointers on vread_tsc

This is a bit silly and needs work for gcc < 4.4 (if we even care),
but, rather surprisingly, it's 0.3ns faster. I guess that the CPU's
stack frame optimizations aren't quite as good as I thought.

CLOCK_MONOTONIC: 15.79ns
CLOCK_REALTIME_COARSE: 3.70ns
CLOCK_MONOTONIC_COARSE: 4.50ns

x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO

We're building the vDSO with optimizations disabled that were meant
for kernel code. Override that, except for -fno-omit-frame-pointers,
which might make userspace debugging harder.

CLOCK_MONOTONIC: 15.66ns
CLOCK_REALTIME_COARSE: 3.44ns
CLOCK_MONOTONIC_COARSE: 4.23ns


Andy Lutomirski (6):
x86-64: Optimize vread_tsc's barriers
x86-64: Don't generate cmov in vread_tsc
x86-64: Put vsyscall_gtod_data at a fixed virtual address
x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
x86-64: Omit frame pointers on vread_tsc
x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO

arch/x86/kernel/tsc.c | 48 ++++++++++++++++++++++++++++++++-------
arch/x86/kernel/vmlinux.lds.S | 13 +++++-----
arch/x86/vdso/Makefile | 15 +++++++++++-
arch/x86/vdso/vclock_gettime.c | 40 ++++++++++++++++++---------------
arch/x86/vdso/vextern.h | 9 ++++++-
5 files changed, 90 insertions(+), 35 deletions(-)

--
1.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/