Re: [RFC][PATCH 0/5] arch: atomic rework

From: Linus Torvalds
Date: Tue Feb 18 2014 - 17:15:02 EST

Next message: Rafael J. Wysocki: "Re: pci-3.14 resource alloc"
Previous message: riel: "[PATCH -mm 1/3] sched,numa: add cond_resched to task_numa_work"
In reply to: Peter Zijlstra: "Re: [RFC][PATCH 0/5] arch: atomic rework"
Next in thread: Torvald Riegel: "Re: [RFC][PATCH 0/5] arch: atomic rework"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel <triegel@xxxxxxxxxx> wrote:
>>
>> So imagine that you have some clever global optimizer that sees that
>> the program never ever actually sets the dirty bit at all in any
>> thread, and then uses that kind of non-local knowledge to make
>> optimization decisions. THAT WOULD BE BAD.
>>
>> Do you see what I'm aiming for?
>
> Yes, I do. But that seems to be "volatile" territory. It crosses the
> boundaries of the abstract machine, and thus is input/output. Which
> fraction of your atomic accesses can read values produced by hardware?
> I would still suppose that lots of synchronization is not affected by
> this.

The "hardware can change things" case is indeed pretty rare.

But quite frankly, even when it isn't hardware, as far as the compiler
is concerned you have the exact same issue - you have TLB faults
happening on other CPU's that do the same thing asynchronously using
software TLB fault handlers. So *semantically*, it really doesn't make
any difference what-so-ever if it's a software TLB handler on another
CPU, a microcoded TLB fault, or an actual hardware path.

So if the answer for all of the above is "use volatile", then I think
that means that the C11 atomics are badly designed.

The whole *point* of atomic accesses is that stuff like above should
"JustWork(tm)"

> Do you perhaps want a weaker form of volatile? That is, one that, for
> example, allows combining of two adjacent loads of the dirty bits, but
> will make sure that this is treated as if there is some imaginary
> external thread that it cannot analyze and that may write?

Yes, that's basically what I would want. And it is what I would expect
an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a
bit of a misnomer, because technically we really generally want
"ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use
volatile, and is a hell of a lot simpler to write ;^).

So we obviously use "volatile" for this currently, and generally the
semantics we really want are:

- the load or store is done as a single access ("atomic")

- the compiler must not try to re-materialize the value by reloading
it from memory (this is the "at most once" part)

and quite frankly, "volatile" is a big hammer for this. In practice it
tends to work pretty well, though, because in _most_ cases, there
really is just the single access, so there isn't anything that it
could be combined with, and the biggest issue is often just the
correctness of not re-materializing the value.

And I agree - memory ordering is a totally separate issue, and in fact
we largely tend to consider it entirely separate. For cases where we
have ordering constraints, we either handle those with special
accessors (ie "atomic-modify-and-test" helpers tend to have some
serialization guarantees built in), or we add explicit fencing.

But semantically, C11 atomic accessors *should* generally have the
correct behavior for our uses.

If we have to add "volatile", that makes atomics basically useless. We
already *have* the volatile semantics, if atomics need it, that just
means that atomics have zero upside for us.

>> But *local* optimizations are fine, as long as they follow the obvious
>> rule of not actually making changes that are semantically visible.
>
> If we assume that there is this imaginary thread called hardware that
> can write/read to/from such weak-volatile atomics, I believe this should
> restrict optimizations sufficiently even in the model as specified in
> the standard.

Well, what about *real* threads that do this, but that aren't
analyzable by the C compiler because they are written in another
language entirely (inline asm, asm, perl, INTERCA:. microcode,
PAL-code, whatever?)

I really don't think that "hardware" is necessary for this to happen.
What is done by hardware on x86, for example, is done by PAL-code
(loaded at boot-time) on alpha, and done by hand-tuned assembler fault
handlers on Sparc. The *effect* is the same: it's not visible to the
compiler. There is no way in hell that the compiler can understand the
hand-tuned Sparc TLB fault handler, even if it parsed it.

>> IOW, I would *revel* in the fact that different machines are
>> different, and basically just describe the "stupid" code generation.
>> You'd get the guaranteed semantic baseline, but you'd *also* be able
>> to know that whatever architecture guarantees you have would remain.
>> Without having to describe those architecture issues.
>
> Would you be okay with this preventing lots of optimizations a compiler
> otherwise could do? Because AFAICT, this spreads into non-synchronizing
> code via the dependency-tracking, for example.

Actually, it probably wouldn't really hurt code generation.

The thing is, you really have three cases:

- architectures that have weak memory ordering and fairly stupid
hardware (power, some day maybe ARM)

- architectures that expose a fairly strong memory ordering, but
reorders aggressively in hardware by tracking cacheline lifetimes
until instruction retirement (x86)

- slow hardware (current ARM, crap Atom hardware etc embedded devices
that don't really do either)

and let's just agree that anybody who does high-performance work
doesn't really care about the third case, ok?

The compiler really wouldn't be that constrained in ordering on a
weakly ordered machine: sure, you'll have to add "lwsync" on powerpc
pretty much every time you have an ordered atomic read, but then you
could optimize them away for certain cases (like "oh, it was a
'consume' read, and the only consumer is already data-dependent, so
now we can remove the lwsync"), so you'd end up with pretty much the
code you looked for anyway.

And the HPC people who use the "relaxed" model wouldn't see the extra
memory barriers anyway.

And on a non-weakly ordered machine like x86, there are going to be
hardly any barriers in the first place, and the hardware will do the
right thing wrt ordering, so it's not like the compiler has gotten
prevented from any big optimization wins.

So I really think a simpler model that tied to the hardware directly -
which you need to do *anyway* since the memory ordering constraints
really are hw-dependent - would have been preferable. The people who
don't want to take advantage of hardware-specific guarantees would
never know the difference (they'll rely purely on pairing
acquire/release and fences properly - the way they *already* have to
in the current C11 model), and the people who _do_ care about
particular hw guarantees would get them by definition.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Rafael J. Wysocki: "Re: pci-3.14 resource alloc"
Previous message: riel: "[PATCH -mm 1/3] sched,numa: add cond_resched to task_numa_work"
In reply to: Peter Zijlstra: "Re: [RFC][PATCH 0/5] arch: atomic rework"
Next in thread: Torvald Riegel: "Re: [RFC][PATCH 0/5] arch: atomic rework"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]