Re: clang asm-goto support (Was Re: [PATCH v2] x86/retpoline: Add clang support)

From: Ingo Molnar
Date: Wed Feb 14 2018 - 18:07:28 EST



* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

> To quantify it: I just performed a test build of a Linux distro kernel config
> (Fedora x86-64), and counted the number of callsites that use 'asm goto'
> functionality with the v4.15 kernel (including drivers).
>
> The results:
>
> Linux distro | !CONFIG_TRACING
> -----------------------------------------------------------------------------
> total # of functions : 191,567 | 184,443
> total # of instructions : 14,251,355 | 13,526,112
> -----------------------------------------------------------------------------
> total # of spin_lock*() calls : 25,246 | 25,177
> total # of mutex_lock*() calls : 13,062 | 12,861
> total # of kmalloc*() calls : 5,148 | 5,118
> -----------------------------------------------------------------------------
> total # of 'asm goto' usage sites : 34,851 | 31,059
> total # of 'asm goto' using functions : 18,209 | 16,089
> -----------------------------------------------------------------------------
> percent of kernel functions using 'asm goto' : 9.5% | 8.7%
> -----------------------------------------------------------------------------

Here's the size stats of kernel/sched/built-in.o for the same distro config:

optimized | no asm goto
-----------------------------------------------------------------------------
total # of functions : 765 | 764
total # of instructions : 46,830 | 47,051

I.e. asm goto support reduces scheduler size by ~0.5%, which is a major generated
code size reduction.

This doesn't count the live branch patching performance advantages: many of those
asm goto usage sites are in hot paths, so the performance impact of it is much
larger than that: easily a couple of percentage points in scheduler intensive
benchmarks, as Peter mentioned.

For example here's a thread context switch benchmark comparison on a modern x86
system running a v4.15 kernel:

$ perf stat --repeat 20 --sync --null perf bench sched messaging -t -g 25

no asm goto: 0.136778505 seconds time elapsed ( stddev: +- 0.55% )
asm goto optimized: 0.133773904 seconds time elapsed ( stddev: +- 0.51% )

The asm goto enabled kernel is ~2.25% faster in this benchmark, and the
performance penalty of not having asm goto support will only increase in the
future.

i.e. it very much makes sense to implement asm goto support not just for
compatibility reasons, but for performance reasons as well.

Thanks,

Ingo