Re: Purgatory compile flag changes apparently causing Kexec relocation overflows

From: Nick Desaulniers
Date: Wed Aug 28 2019 - 18:07:16 EST


On Wed, Aug 28, 2019 at 2:51 PM Nick Desaulniers
<ndesaulniers@xxxxxxxxxx> wrote:
>
> On Wed, Aug 28, 2019 at 12:42 PM Steve Wahl <steve.wahl@xxxxxxx> wrote:
> >
> > Please CC me on responses to this.
> >
> > I normally would do more diligence on this, but the timing is such
> > that I think it's better to get this out sooner.
> >
> > With the tip of the tree from https://github.com/torvalds/linux.git (a
> > few days old, most recent commit fetched is
> > bb7ba8069de933d69cb45dd0a5806b61033796a3), I'm seeing "kexec: Overflow
> > in relocation type 11 value 0x11fffd000" when I try to load a crash
> > kernel with kdump. This seems to be caused by commit
> > 059f801a937d164e03b33c1848bb3dca67c0b04, which changed the compiler

is this the correct SHA from mainline? I assume you meant
commit b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than
reset KBUILD_CFLAGS")

> > flags used to compile purgatory.ro, apparently creating 32 bit
> > relocations for things that aren't necessarily reachable with a 32 bit
> > reference. My guess is this only occurs when the crash kernel is
> > located outside 32-bit addressable physical space.
> >
> > I have so far verified that the problem occurs with that commit, and
> > does not occur with the previous commit. For this commit, Thomas
> > Gleixner mentioned a few of the changed flags should have been looked
> > at twice. I have not gone so far as to figure out which flags cause
> > the problem.
> >
> > The hardware in use is a HPE Superdome Flex with 48 * 32GiB dimms
> > (total 1536 GiB).
> >
> > One example of the exact error messages seen:
> >
> > 019-08-28T13:42:39.308110-05:00 uv4test14 kernel: [ 45.137743] kexec: Overflow in relocation type 11 value 0x17f7affd000
> > 2019-08-28T13:42:39.308123-05:00 uv4test14 kernel: [ 45.137749] kexec-bzImage64: Loading purgatory failed
>
> Thanks for the report and sorry for the breakage. Can you please send
> me more information for how to precisely reproduce the issue? I'm
> happy to look into fixing it.
>
> Let me go dig up the different listed flags. Steve, it may be fastest
> for you to test re-adding them in your setup to see which one is
> important.

https://lkml.org/lkml/2019/7/26/198 was the list. The "ratpoutine"
flags were added in the final version of the patch that landed. It's
not immediately clear to me which of those 4 changed flags would
result in the error that you're observing, but if you could test them
quickly to see which restores working behavior, we could triple check
it on our end and submit it.

>
> Tglx, if you want to revert the above patches, I'm ok with that. It's
> important that we fix the issue eventually that my patches were meant
> to address, but precisely *when* it's solved isn't critical; our
> kernels can carry out of tree patches for now until the issue is
> completely resolved worst case.

--
Thanks,
~Nick Desaulniers