Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11

From: Linus Torvalds
Date: Thu Nov 09 2017 - 15:04:29 EST


On Thu, Nov 9, 2017 at 11:51 AM, Patrick McLean <chutzpah@xxxxxxxxxx> wrote:
>
> We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and
> CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as
> CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before.

It might be worth just verifying without RANDSTRUCT in particular.

That case has probably not gotten a huge amount of testing. As Al
points out, it can cause absolutely horrendous cache access pattern
changes, but it might also be triggering some corruption in case
there's a problem with the plugin, or with some piece of kernel code
that gets confused by it.

And most obviously: if there is some module or part of the kernel that
got compiled with a different seed for the randstruct hashing, that
will break in nasty nasty ways. Your out-of-kernel module is the
obvious suspect for something like that, but honestly, it could be
some missing build dependency, or simply a missing special case in the
plugin itself a missing __no_randomize_layout or any number of things.

We've hit gcc bugs many times before - and the plugins are just new
opportunities to hit cases that have gotten a lot less testing than
the "normal" code flow has.

The structleak plugin is much less likely to be a problem (simply
because it's a much simpler plugin), but hey, something being NULL
when it shouldn't possibly be might be a stray "leak initialization".

So since you seem to be able to reproduce this _reasonably_ easily,
it's definitely worth checking that it still reproduces even without
the gcc plugins.

Just to narrow it down a bit.

Linus