Re: 4.14.9 doesn't boot (regression)

From: Linus Torvalds
Date: Fri Dec 29 2017 - 20:00:33 EST


f

On Fri, Dec 29, 2017 at 4:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
> Double faults use IST, so a double fault that double faults will effectively just start over rather than eventually running out of stack and triple faulting.
>
> But check out the registers. We have RSP = ...28fd8 and CR2 = ...27f08.
> IOW the double fault stack is ...28000 - ...28fff and we're somehow getting
> a failed page fault a couple hundred bytes below the bottom of the IST stack.
> IOW, I think we're just stuck in a neverending loop of stack overflows.

Ahh, good catch. This feels like it might finally be explaining things.

> (Also, Josh, the oops code should have printed the contents of the struct pt_regs at the top of the DF stack. Any idea why it didn't?)
>
> Toralf, can you send the complete output of:
>
> objdump -dr arch/x86/kernel/traps.o
>
> From the build tree of a nonworking kernel?

Alexander made one of his failing kernels available earlier:

https://www.dropbox.com/s/yesupqgig3uxf73/linux-4.15-rc5%2B.tar.xz?dl=0

and yes, there's something seriously wrong there. Doing a disassembly
on "do_double_fault()" shows:

ffffffff8101bda0 <do_double_fault>:
ffffffff8101bda0: 41 54 push %r12
ffffffff8101bda2: 55 push %rbp
ffffffff8101bda3: 53 push %rbx
ffffffff8101bda4: 48 81 ec 20 10 00 00 sub $0x1020,%rsp
ffffffff8101bdab: 48 83 0c 24 00 orq $0x0,(%rsp)
ffffffff8101bdb0: 48 81 c4 20 10 00 00 add $0x1020,%rsp

WTF? That's bogus crap, and not ok in the kernel. Doing a stack probe
below the stack by subtracting 4128rom the stack pointer and then
oring it, and then resetting the stack pointer again is just crazy.
And it's definitely not ever going to work for the kernel that has a
limited stack.

So yes, It's a terminally broken compiler from hell. I assume gentoo
has applied some completely broken security patch to their compiler,
turning said compiler into complete garbage.

Doing some trivial grepping on the disassembly in that vmlinux file,
there's tons of those "let's probe more than a page below the stack"
issues. The biggest offset I found was 0x1400.

That one happened to be in do_sys_poll().

> Also, you wouldn't happen to be using Gentoo perchance?

Yes, several people involved are using gentoo. Maybe everybody.

> I already have two reports of a Gentoo system miscompiling the vDSO
> due to Gentoo enabling -fstack-check and GCC generating stack check
> code that is highly suboptimal, actively incorrect, and doesn't even
> manage to check the stack in a particularly helpful way.

Yes. Good. I think you root-caused it.

Good. I was not feeling so happy about this bug report, but now I can
firmly just blame the gentoo compiler for having some shit-for-brains
"feature".

Linus