Re: ipv4/tcp.c:4234:1: error: the frame size of 1152 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]

From: Linus Torvalds
Date: Tue Sep 07 2021 - 19:19:55 EST


[ Added maintainers for various bits and pieces, since I spent the
time trying to look at why those bits and pieces wasted stack-space
and caused problems ]

On Tue, Sep 7, 2021 at 3:16 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> None of these seem to be new.

That said, all but one of them seem to be (a) very much worth fixing
and (b) easy to fix.

The do_tcp_getsockopt() one in tpc.c is a classic case of "lots of
different case statements, many of them with their own struct
allocations on stack, and all of them disjoint".

So the fix for that would be the traditional one of just making helper
functions for the cases that aren't entirely trivial. We've done that
before, and it not only fixes stack usage problems, it results in code
that is easier to read, and generally actually performs better too
(exactly because it avoids sparse stacks and extra D$ use)

The pe_test_uints() one is the same horrendous nasty Kunit pattern
that I fixed in commit 4b93c544e90e ("thunderbolt: test: split up test
cases in tb_test_credit_alloc_all") that had an even worse case.

The KUNIT macros create all these individually reasonably small
initialized structures on stack, and when you have more than a small
handful of them the KUNIT infrastructure just makes the stack space
explode. Sometimes the compiler will be able to re-use the stack
slots, but it seems to be an iffy proposition to depend on it - it
seems to be a combination of luck and various config options.

I detest code that exists for debugging or for testing, and that
violates fundamental rules and causes more problems in the process.

The mac802.11 one seems to be due to 'struct ieee802_11_elems' being
big, and allocated on the stack. I think it's probably made worse
there with inlining, ie

- ieee80211_sta_rx_queued_mgmt() has one copy

- ieee80211_rx_mgmt_beacon() is possibly inlined, and has its own copy

but even if it isn't due to that kind of duplication due to inlining,
that code is dangerous. Exactly because it has two nested stack frames
with that big structure, and they are active at the same time in the
callchain whether inlined or not.

And it's *pointlessly* dangerous, because the 'struct ieee802_11_elems
elems' in ieee80211_sta_rx_queued_mgmt() is only used for the
IEEE80211_STYPE_ACTION case, so it is entirely disjoint from the
IEEE80211_STYPE_BEACON case, and those stack allocations simply should
not nest like that in the first place.

Making the IEEE80211_STYPE_ACTION case be its own function - like the
other cases - and moving the struct there should fix it. Possibly a
"noinline" or two necessary to make sure that the compiler doesn't
then undo the "these two cases are disjoint" thing.

The qede_config_rx_mode() has a fairly big 'struct qed_filter_params'
structure on stack (it's mainly just a union of two other structures).
That one is a bit silly, because the very same function *alsu* does a
temporary allocation for the 'uc_macs[]' array, and I think it could
have literally made that allocation just do both the params and the
uc_macs[] array together.

But that "a bit silly" is actually *doubly* silly, because that big
structure allocated for the stack that is actually a union, uses the
QED_FILTER_TYPE_RX_MODE type of the union. Which in turn is literally
*one*single*enum*field*.

So the qede_config_rx_mode() case uses that chunk of kernel stack for
a big union for no good reason. It really only wants two words, but
the way the code is written, it uses a lot, because the union also has
a 'struct qed_filter_mcast_params' member that has an array of
[64][ETH_ALEN] bytes in it.

So that's about 400 bytes of stack space entirely wasted if I read the
code correctly.

The xhci_reserve_bandwidth() one is because it has an array of 31
'struct xhci_bw_info' on the stack. It's not a huge structure (six
32-bit words), but when you have 31 of those in an array, it's about
750 bytes right there. It should likely just be dynamically allocated
- it doesn't seem to be some super-important critical thing where an
allocation cannot be done.

The do_sys_poll() thing is a bit sad. The code has been tweaked to
basically use 1kB of stack space in one configuration. It overflows it
in a lot of other configs. Using stack space for those kinds of
top-level functions that are guaranteed to have an empty stack is
pretty much the best possible situation, but it's one where we don't
really have a good way to try to have some kind of dynamic feedback
from the compiler for how much other stack space it is using.

So that do_sys_poll() case is the only one I see where the stack usage
is actually fine and explicitly expected. We *aim* for 1kB of stack,
and then in some - probably quite a few - situations we go over.

There are many more of these cases. I've seen Hyper-V allocate 'struct
cpumask' on the stack, which is once again an absolute no-no that
people have apparently just ignored the warning for. When you have
NR_CPUS being the maximum of 8k, those bits add up, and a single
cpumask is 1kB in size. Which is why you should never do that on
stack, and instead use '

cpumask_var_t mask;
alloc_cpumask_var(&mask,..)

which will do a much more reasonable job. But the reason I call out
hyperv is that as far as I know, hyperv itself doesn't actually
support 8192 CPU's. So all that apic noise with 'struct cpumask' that
uses 1kB of data when NR_CPUS is set to 8192 is just wasted. Maybe I'm
wrong. Adding hyperv people to the cc too.

A lot of the stack frame size warnings are hidden by the fact that our
default value for warning about stack usage is 2kB for 64-bit builds.

Probably exactly because people did things like that cpumask thing,
and have these arrays of structures that are often even bigger in the
64-bit world.

Linus