Re: [PATCH 0/7] ARM: hacks for link-time optimization

From: Peter Zijlstra
Date: Tue Dec 18 2018 - 04:18:40 EST

On Mon, Dec 17, 2018 at 04:08:00PM -0800, Andi Kleen wrote:
> On Mon, Dec 17, 2018 at 11:50:20PM +0100, Peter Zijlstra wrote:
> > On Tue, Feb 20, 2018 at 10:59:47PM +0100, Arnd Bergmann wrote:
> > > Hi Nico, all,
> > >
> > > I was playing with ARM link-time optimization handling earlier this
> > > month, and eventually got it to build cleanly with randconfig kernels,
> > > but ended up with a lot of ugly hacks to actually pull it off.
> >
> > How are we dealing with the fact that LTO can break RCU in very subtle
> > and scary ways?
> >
> > Do we have a compiler guy on board that has given us a compiler switch
> > that kills that optimization (and thereby guarantees that behaviour for
> > future compilers etc..) ?
> Can you actually define what optimization you are worred about?
> If there are optimizations that cause problems they likely happen
> even without LTO inside files. The only difference with LTO is that it
> does them between files too.

In particular turning an address-dependency into a control-dependency,
which is something allowed by the C language, since it doesn't recognise
these concepts as such.

The 'optimization' is allowed currently, but LTO will make it much more
likely since it will have a much wider view of things. Esp. when combined
with PGO.

Specifically; if you have something like:

int idx;
struct object objs[2];

the statement:

val = objs[idx & 1].ponies;

which you 'need' to be translated like:

struct object *obj = objs;
obj += (idx & 1);
val = obj->ponies;

Such that the load of obj->ponies depends on the load of idx. However
our dear compiler is allowed to make it:

if (idx & 1)
obj = &objs[1];
obj = &objs[0];

val = obj->ponies;

Because C doesn't recognise this as being different. However this is
utterly broken, because in this translation we can speculate the load
of obj->ponies such that it no longer depends on the load of idx, which
breaks RCU.

Note that further 'optimization' is possible and the compiler could even
make it:

if (idx & 1)
val = objs[1].ponies;
val = objs[0].ponies;

Now, granted, this is a fairly artificial example, but it does
illustrate the exact problem.

The more the compiler can see of the complete program, the more likely
it can make inferrences like this, esp. when coupled with PGO.

Now, we're (usually) very careful to wrap things in READ_ONCE() and
rcu_dereference() and the like, which makes it harder on the compiler
(because 'volatile' is special), but nothing really stops it from doing

Paul has been trying to beat clue into the language people, but given
he's been at it for 10 years now, and there's no resolution, I figure we
ought to get compiler implementations to give us a knob.