Re: [PATCH] Enable '-Werror' by default for all kernel builds
From: Arnd Bergmann
Date: Tue Sep 07 2021 - 04:57:02 EST
On Tue, Sep 7, 2021 at 1:51 AM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
> On Mon, Sep 06, 2021 at 04:06:04PM -0700, Linus Torvalds wrote:
> > [ Adding some subsystem maintainers ]
> >
> > On Mon, Sep 6, 2021 at 10:06 AM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
> > >
> > > > But hopefully most cases are just "people haven't cared enough" and
> > > > easily fixed.
> > >
> > > We'll see. For my testbed I disabled the new configuration flag
> > > for the time being because its primary focus is boot tests, and
> > > there won't be any boot tests if images fail to build.
> >
> > Sure, reasonable.
> >
> > I've checked a few of the build errors by doing the appropriate cross
> > compiles, and it doesn't seem bad - but it does seem like we have a
> > number of really pointless long-standing warnings that should have
> > been fixed long ago.
I have a tree with fixes for anything that has hit on arm, arm64 or x86.
There are many reasons why some patch never made it in, but usually
it's because I was not persistent about resending the fix when the first
version didn't make it. In other cases I wasn't sure about my own fix.
> > For example, looking at sparc64, there are several build errors due to
> > those warnings now being fatal:
> >
> > - drivers/gpu/drm/ttm/ttm_pool.c:386
> >
> > This is a type mismatch error. It looks like __fls() on sparc64
> > returns 'int'. And the ttm_pool.c code assumes it returns 'unsigned
> > long'.
> > Oddly enough, the very line after that line does "min_t(unsigned
> > int" to get the types in line.
> > So the immediate reason is "sparc64 is different".
arc is the same as sparc here, but everything else uses unsigned long.
We've come a long way in making all those helper functions consistent
in their types, but there are still a number of exceptions.
> > But the deeper
> > reason seems to be that ttm_pool.c has odd type assumptions. But that
> > warning should have been fixed long ago, either way.
> >
> > Christian/Huang? I get the feeling that both lines in that file
> > should use the min_t(). Hmm?
> >
> > - drivers/input/joystick/analog.c:160
> >
> > #warning Precise timer not defined for this architecture.
> >
> > Unfortunate. I suspect that warning just has to be removed. It has
> > never caused anything to be fixed, it's old to the point of predating
> > the git history. Dmitry?
> >
> My solution would be to just remove the old code (that isn't using ktime)
> including the module parameter that disables it. Sure, we want to be
> backward compatible, but that code is 15+ years old and should really be
> retired.
Agreed. I added a couple of architectures to the #ifdef check over time,
but realistically this driver is only ever used on x86-32 anyway, and
we don't even care about the others here.
If we remove the #else path here, I'd make it "depends on ISA ||
COMPILE_TEST".
> > - at least a couple of stringop-overread errors. Attached is a
> > possible for for one of them.
> >
> > The stringop overread is odd, because another one of them is
> >
> > fs/qnx4/dir.c: In function ‘qnx4_readdir’:
> > fs/qnx4/dir.c:51:32: error: ‘strnlen’ specified bound 48 exceeds
> > source size 16 [-Werror=stringop-overread]
> > 51 | size = strnlen(de->di_fname, size);
> > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > but I'm not seeing why that one happens on sparc64, but not on arm64
> > or x86-64. There doesn't seem to be anything architecture-specific
> > anywhere in that area.
> >
> > Funky.
> >
> Not really. That is because de->di_fname is always 16 bytes but size
> can be 48 if the node is really a link. The use of de is overloaded
> in that case; de is struct qnx4_inode_entry (where di_fname is 16 bytes)
> but the actual data is struct qnx4_link_info where the name is 48 bytes
> long. A possible fix (compile tested only) is below.
>
> I think the warning/error is only reported with gcc 11.x. Do you possibly
> use an older compiler for x86/arm64 ?
>
> Anyway, below is a partial list of build errors I have seen. Some of
> them are easy to fix (such as the ones due to unused functions),
> but others seem to be tricky.
This one is worse, I think this is the same warning as the one I
reported as a false-positive in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99673
and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578
I submitted a patch in
https://lore.kernel.org/all/20210322160253.4032422-6-arnd@xxxxxxxxxx/
To summarize the problems:
- gcc and clang have different approaches to this type of warning: clang
tries to only produce diagnostics that are 100% reproducible regardless
of compiler internals, while gcc tries to use as much information as it has
to warn about things that may go wrong, including things that it only knows
because of inlining. Making this warning reliable is a variation of the
halting problem, just like the -Wmaybe-uninitialized warnings. The diagnostic
is definitely helpful and I found real bugs because of it, but you can never
be sure that you have found all instances.
- Some of the -Wstringop-overread warnings (and related ones) from gcc are
actually wrong, because the object size is just a heuristic. If you
have multiple
overlapping fixed-length fields in a union, gcc-11 picks one of the union
members to determine the size of the string buffer within the structure, even
when the string operation uses a different union member as the output, and
that member has the correct size.
This is also a common problem: when a new warning option gets introduced
first, there are false positives that get fixed in subsequent
compiler versions.
> alpha.log:arch/alpha/kernel/setup.c:493:13: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
I sent a couple of fixes for these: this is another false-postive bug
in gcc that made it
into the release, and it triggers on every architecture that accesses
the boot parameters
at a fixed pointer. The problem is that gcc treats '(void *)0x12345'
the same as 'NULL +
0x12345', and decides that this is an invalid NULL pointer access, so
the array has
zero readable bytes.
> alpha.log:drivers/atm/ambassador.c:1747:58: error: passing argument 1 of 'virt_to_bus' discards 'volatile' qualifier from pointer target type [-Werror=discarded-qualifiers]
Surely an alpha specific mistake, though we could fix all those
drivers to drop the
'volatile'.
> alpha.log:drivers/net/ethernet/amd/ni65.c:751:37: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
Nobody tests ISA drivers on 64-bit architectures...
> alpha.log:drivers/net/hamradio/6pack.c:71:41: error: unsigned conversion from 'int' to 'unsigned char' changes value from '256' to '0' [-Werror=overflow]
This driver is apparently broken for any HZ >= 1024
> ppc.log:drivers/net/ethernet/cirrus/cs89x0.c:897:41: error: implicit declaration of function 'isa_virt_to_bus' [-Werror=implicit-function-declaration]
My fix is in the network tree.
> riscv32.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: 'cdn_dp_resume' defined but not used [-Werror=unused-function]
> riscv.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: 'cdn_dp_resume' defined but not used [-Werror=unused-function]
A fix was submitted today, we get at least a dozen of those for each
kernel release, and there
is a plan for avoiding them altogether, but it's a giant treewide
change that nobody has managed
to tackle.
> s390.log:arch/s390/kernel/syscall.c:168:1: error: '__do_syscall' uses dynamic stack allocation [-Werror]
This is add_random_kstack_offset(). No idea why it doesn't trigger on
x86, but that
warning should probably get shut up inside of the macro.
> sparc64.log:arch/sparc/kernel/mdesc.c:647:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
> sparc64.log:arch/sparc/kernel/mdesc.c:692:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
> sparc64.log:arch/sparc/kernel/mdesc.c:719:21: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
Same as on alpha
Arnd