gcc-10: kernel stack is corrupted and fails to boot

From: Kalle Valo
Date: Wed May 13 2020 - 02:50:28 EST


(trimming CC, changing title)

Kalle Valo <kvalo@xxxxxxxxxxxxxx> writes:

> Kalle Valo <kvalo@xxxxxxxxxxxxxx> writes:
>
>> Arnd Bergmann <arnd@xxxxxxxx> writes:
>>
>>> gcc-10 correctly points out a bug with a zero-length array in
>>> struct ath10k_pci:
>>>
>>> drivers/net/wireless/ath/ath10k/ahb.c: In function 'ath10k_ahb_remove':
>>> drivers/net/wireless/ath/ath10k/ahb.c:30:9: error: array subscript 0
>>> is outside the bounds of an interior zero-length array 'struct
>>> ath10k_ahb[0]' [-Werror=zero-length-bounds]
>>> 30 | return &((struct ath10k_pci *)ar->drv_priv)->ahb[0];
>>> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> In file included from drivers/net/wireless/ath/ath10k/ahb.c:13:
>>> drivers/net/wireless/ath/ath10k/pci.h:185:20: note: while referencing 'ahb'
>>> 185 | struct ath10k_ahb ahb[0];
>>> | ^~~
>>>
>>> The last addition to the struct ignored the comments and added
>>> new members behind the array that must remain last.
>>>
>>> Change it to a flexible-array member and move it last again to
>>> make it work correctly, prevent the same thing from happening
>>> again (all compilers warn about flexible-array members in the
>>> middle of a struct) and get it to build without warnings.
>>
>> Very good find, thanks! This bug would cause all sort of strange memory
>> corruption issues.
>
> This motivated me to switch to using GCC 10.x and I noticed that you had
> already upgraded crosstool so it was a trivial thing to do, awesome :)
>
> https://mirrors.edge.kernel.org/pub/tools/crosstool/

And now I have a problem :) I first noticed that my x86 testbox is not
booting when I compile the kernel with GCC 10.1.0 from crosstool. I
didn't get any error messages so I just downgraded the compiler and the
kernel was booting fine again. Next I decided to try GCC 10.1 with my
x86 laptop and it also failed to boot, but this time I got kernel logs
and saw this:

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_secodary+0x178/0x180

Call Trace:
dump_stack
panic
? _raw_spin_unlock_irqrestore
? start_secondary
__stack_chk_fail
start_secondary
secondary_startup

(I wrote the above messages manually from a picture so expect typos)

Then also on my x86 laptop I downgraded the compiler to GCC 8.1.0 (from
crosstool), rebuilt the exactly same kernel version and the kernel
booted without issues.

I'm using 5.7.0-rc4-wt-ath+ which is basically v5.7-rc4 plus latest
wireless patches, and I doubt the wireless patches are making any
difference this early in the boot. All compilers I use are prebuilt
binaries from kernel.org crosstool repo[1] with addition of ccache
v3.4.1 to speed up my builds.

Any ideas? How should I debug this further?

[1] https://mirrors.edge.kernel.org/pub/tools/crosstool/

--
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches