Re: [GIT pull] perf/urgent for 5.7-rc2

From: Ingo Molnar
Date: Mon Apr 20 2020 - 03:48:52 EST



* Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:

> On Sun, Apr 19, 2020 at 11:56:51AM -0700, Linus Torvalds wrote:
>
> > So I'm wondering if there any way that objtool could be run at
> > link-time (and archive time) rather than force a re-build of all the
> > object files from source?
>
> We've actually been making progress in that direction. Peter added
> partial vmlinux.o support, for Thomas' noinstr validation. The problem
> is, linking is single-threaded so it ends up making the kernel build
> slower overall.
>
> So right now, we still do most things per compilation unit, and only do
> the noinstr validation at vmlinux.o link time. Eventually, especially
> with LTO, we'll probably end up moving everything over to link time.

Fortunately, much of what objtool does against vmlinux.o can be
parallelized in a rather straightforward fashion I believe, if we build
with -ffunction-sections.

Here's the main "objtool check" processing steps:

int check(const char *_objname, bool orc)
{
...
ret = decode_sections(&file);
...

ret = validate_functions(&file);
...
ret = validate_unwind_hints(&file);
...
ret = validate_reachable_instructions(&file);
...
ret = create_orc(&file);
...
ret = create_orc_sections(&file);
}

The 'decode_sections()' step takes about 92% of the runtime against
vmlinux.o:

$ taskset 1 perf stat --repeat 3 --sync --null tools/objtool/objtool check vmlinux.o

Performance counter stats for 'tools/objtool/objtool check vmlinux.o' (3 runs):

3.05757 +- 0.00247 seconds time elapsed ( +- 0.08% )

$ taskset 1 perf stat --repeat 3 --exit-after-decode --null tools/objtool/objtool check vmlinux.o

Performance counter stats for 'tools/objtool/objtool check vmlinux.o' (3 runs):

2.83132 +- 0.00272 seconds time elapsed ( +- 0.10% )

(The --exit-after-decode hack makes it exit right after
decode_sections().)

Within decode_sections(), the main overhead is in decode_instructions()
(~75% of the total objtool overhead):

2.31325 +- 0.00609 seconds time elapsed ( +- 0.26% )

This goes through every executable section, to decode the instructions:

static int decode_instructions(struct objtool_file *file)
{
...
for_each_sec(file, sec) {

if (!(sec->sh.sh_flags & SHF_EXECINSTR))
continue;

The size distribution of function section sizes is strongly biased
towards section sizes of 100 bytes or less, over 95% of all instructions
in the vmlinux.o are in such a section.

In fact over 99% of all decoded instructions are in a section of 500
bytes or smaller, so a threaded decoder where each thread batch-decodes a
handful of sections in a single processing step and then batch-inserts it
into the (global) instructions hash should do the trick.

The batching size could be driven by section byte size, i.e. we could say
that the unit of batching is for a decoding thread to grab ~10k bytes
worth of sections from the list, build a local list of decoded
instructions, and then insert them into the global hash in a single go.
This would scale very well IMO, with the defconfig already having almost
3 million instructions, and a distro build or allmodconfig build a lot
more.

I believe the 3.0 seconds total objdump runtime above could be reduced to
below 1.0 second on typical contemporary development systems - which
would IMHO make it a feasible model to run objtool only against the whole
kernel binary.

Is there any code generation disadvantage or other quirk to
-ffunction-sections, or other complications that I missed, that would
make this difficult?

Thanks,

Ingo