Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

From: Jonas Oberhauser
Date: Wed Mar 01 2023 - 05:53:00 EST




On 2/28/2023 4:40 PM, Paul E. McKenney wrote:
On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:

On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
On 2/27/2023 8:40 PM, Andrea Parri wrote:
The LKMM doesn't believe that a control or data dependency orders a
plain write after a marked read. Hence in this test it thinks that P1's
store to u0 can happen before the load of x1. I don't remember why we
did it this way -- probably we just wanted to minimize the restrictions
on when plain accesses can execute. (I do remember the reason for
making address dependencies induce order; it was so RCU would work.)

The patch below will change what the LKMM believes. It eliminates the
positive outcome of the litmus test and the data race. Should it be
adopted into the memory model?
(Unpopular opinion I know,) it should drop dependencies ordering, not
add/promote it.

Andrea
Maybe not as unpopular as you think... :)
But either way IMHO it should be consistent; either take all the
dependencies that are true and add them, or drop them all.
In the latter case, RCU should change to an acquire barrier. (also, one
would have to deal with OOTA in some yet different way).

Generally my position is that unless there's a real-world benchmark with
proven performance benefits of relying on dependency ordering, one should
use an acquire barrier. I haven't yet met such a case, but maybe one of you
has...
https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).

Though this is admittedly for ARMv7 and PowerPC.

Thanks for the link.

It's true that on architectures that don't have an acquire load (and have to
use a fence), the penalty will be bigger.

But the more obvious discussion would be what constitutes a real-world
benchmark : )
In my experience you can get a lot of performance benefits out of optimizing
barriers in code if all you execute is that code.
But once you embed that into a real-world application, often 90%-99% of time
spent will be in the business logic, not in the data structure.

And then the benefits suddenly disappear.
Note that a lot of barriers are a lot cheaper as well when there's no
contention.

Because of that, making optimization decisions based on microbenchmarks can
sometimes lead to a very poor "time invested" vs "total product improvement"
ratio.
All true, though that 2x and 4x should be worth something.

I think the most egregious case we had (not barrier related, but cache related) was something like ~100x in specific benchmarks and then I think somewhere around 1% system-wide. I think the issue was that in the larger system, we couldn't keep the cache hot, so our greatly improved data locality was being diluted.
But of course it always depends on how much that component actually contributes to the overall system performance.
Which IMHO should be one of the measurements taken before starting to invest heavily into optimizations.

Surprisingly, many people don't want to do that. I think it's something about the misleading calculus of "2 months implementing the optimization + 2 weeks building robust benchmarks & profiling infrastructure > 2 months implementing the optimization" instead of "2 weeks building robust benchmarks & profiling infrastructure + 0 months implementing a useless optimization < 2 months implementing the optimization", which seems to be the more common case.


The real-world examples I know of involved garbage collectors, and the
improvement was said to be a few percent system-wide. But that was a
verbal exchange, so I don't have a citation for you.

Thanks for the example, it sounds very reasonable (at least for platforms like PowerPC).
GC has all the hallmarks of a good optimization target: measurable impact on system wide throughput and latency, and pointer chasing (=dependency ordering) being a hot part inside the algorithm.

Best wishes,
jonas