Re: Random panic in load_balance() with 3.16-rc

From: Alexei Starovoitov
Date: Thu Jul 24 2014 - 23:55:47 EST


On Fri, Jul 25, 2014 at 10:25:03AM +0900, Michel Dänzer wrote:
> [ Adding the Debian kernel and gcc teams to Cc ]
>
> > movq $load_balance_mask, -136(%rbp) #, %sfp
> > subq $184, %rsp #,
> >
> > Anyway, this is not a kernel bug. This is your compiler creating
> > completely broken code. We may need to add a warning to make sure
> > nobody compiles with gcc-4.9.0, and the Debian people should probably
> > downgrate their shiny new compiler.
>
> Attached is fair.s from Debian gcc 4.8.3-5. Does that look better? I'm
> going to try reproducing the problem with a kernel built by that now.

4.8 and 4.7 don't hit the problem on this test.
4.9 with -O2 compiles this file ok. 4.9 with -Os triggers it.

-mno-red-zone only affected prologue emition in gcc. This part didn't
change between the releases. So the bug is quite deep.
What seems to be happening is that 2nd pass of instruction scheduler
(after emit prologue and reg alloc) is ignoring barrier properties
of 'subq $184, %rsp' and moving 'movq $.., -136(%rbp)' instruction
ahead of it. afaik rtl sched was never aware of 'red-zone'.
As an ex-compiler guy, I'm worried that this bug exists in earlier
releases. rtl backend guys need to take a serious look at it.
imo this is very serious bug, since broken red-zone is extremelly
hard to debug.
There are two weak test in gcc testsuite related to -mno-red-zone,
but not a single test that actually check that it is doing
the right thing. It is scary. I hope I'm wrong with this analysis.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/