"struct perf_sample_data" alignment

From: Linus Torvalds
Date: Thu Mar 04 2021 - 22:46:05 EST


Sp there's a note about new warnings in 5.12-rc1 that I looked at, and
many of the warnings made me go "Whaaa?". They were all of the type

warning: 'perf_event_aux_event' uses dynamic stack allocation

and then when I go look, I see nothing that looks like a dynamic stack
allocation at all.

But you know what? The warning is kind of misleading, but at the same
time it's true in a sense. The problem is that the function (and a lot
of other functions) has a local variable like this:

struct perf_sample_data sample;

and the definition of that "struct perf_sample_data" has
____cacheline_aligned at the end.

And guess what? That means that now the compiler has actually to play
games with manually aligning the frame of that function, since the
natural stack alignment is *not* a full cacheline aligned thing. So
it's kind of true: the frame allocation is mnot a simple static thing,
it's a nasty complex thing that wastes memory and time.

That ____cacheline_aligned goes back many years, this is not new, it
seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
the perf_sample_data struct layout").

But it really seems entirely and utterly bogus. That cacheline
alignment makes things *worse*, when the variables are on the local
stack. The local stack is already going to be dirty and in the cache,
and aligning those things isn't going to - and I quote from the code
in that commend in that commit - "minimize the cachelines touched".

Quite the reverse. It's just going to make the stack frame use *more*
memory, and make any cacheline usage _worse_.

Maybe there are static (non-automatic) cases of that "struct
perf_sample_data" where the alignment makes sense, but it does not
seem to make sense on the _type_. It should be on those individual
variables, not on every perf_sample_data.

I didn't make any real effort to analyze this, but from a few quick
greps, it looks like every single case is an automatic variable on the
stack, and that the forced alignment is actually a bad thing every
single time. It never ever generates better cache use, but it _does_
generate worse code, and bigger stack frames.

Hmm? What am I missing?

Linus