Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

From: Paul E. McKenney
Date: Wed Mar 01 2023 - 20:43:17 EST


On Wed, Mar 01, 2023 at 11:52:09AM +0100, Jonas Oberhauser wrote:
>
>
> On 2/28/2023 4:40 PM, Paul E. McKenney wrote:
> > On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:
> > >
> > > On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
> > > > On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
> > > > > On 2/27/2023 8:40 PM, Andrea Parri wrote:
> > > > > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > > > > plain write after a marked read. Hence in this test it thinks that P1's
> > > > > > > store to u0 can happen before the load of x1. I don't remember why we
> > > > > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > > > > on when plain accesses can execute. (I do remember the reason for
> > > > > > > making address dependencies induce order; it was so RCU would work.)
> > > > > > >
> > > > > > > The patch below will change what the LKMM believes. It eliminates the
> > > > > > > positive outcome of the litmus test and the data race. Should it be
> > > > > > > adopted into the memory model?
> > > > > > (Unpopular opinion I know,) it should drop dependencies ordering, not
> > > > > > add/promote it.
> > > > > >
> > > > > > Andrea
> > > > > Maybe not as unpopular as you think... :)
> > > > > But either way IMHO it should be consistent; either take all the
> > > > > dependencies that are true and add them, or drop them all.
> > > > > In the latter case, RCU should change to an acquire barrier. (also, one
> > > > > would have to deal with OOTA in some yet different way).
> > > > >
> > > > > Generally my position is that unless there's a real-world benchmark with
> > > > > proven performance benefits of relying on dependency ordering, one should
> > > > > use an acquire barrier. I haven't yet met such a case, but maybe one of you
> > > > > has...
> > > > https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
> > > >
> > > > Though this is admittedly for ARMv7 and PowerPC.
> > > >
> > > Thanks for the link.
> > >
> > > It's true that on architectures that don't have an acquire load (and have to
> > > use a fence), the penalty will be bigger.
> > >
> > > But the more obvious discussion would be what constitutes a real-world
> > > benchmark : )
> > > In my experience you can get a lot of performance benefits out of optimizing
> > > barriers in code if all you execute is that code.
> > > But once you embed that into a real-world application, often 90%-99% of time
> > > spent will be in the business logic, not in the data structure.
> > >
> > > And then the benefits suddenly disappear.
> > > Note that a lot of barriers are a lot cheaper as well when there's no
> > > contention.
> > >
> > > Because of that, making optimization decisions based on microbenchmarks can
> > > sometimes lead to a very poor "time invested" vs "total product improvement"
> > > ratio.
> > All true, though that 2x and 4x should be worth something.
>
> I think the most egregious case we had (not barrier related, but cache
> related) was something like ~100x in specific benchmarks and then I think
> somewhere around 1% system-wide. I think the issue was that in the larger
> system, we couldn't keep the cache hot, so our greatly improved data
> locality was being diluted.
> But of course it always depends on how much that component actually
> contributes to the overall system performance.
> Which IMHO should be one of the measurements taken before starting to invest
> heavily into optimizations.

Agreed, it is all too easy for profound local optimizations to have
near-zero global effect.

> Surprisingly, many people don't want to do that. I think it's something
> about the misleading calculus of "2 months implementing the optimization + 2
> weeks building robust benchmarks & profiling infrastructure > 2 months
> implementing the optimization" instead of "2 weeks building robust
> benchmarks & profiling infrastructure + 0 months implementing a useless
> optimization < 2 months implementing the optimization", which seems to be
> the more common case.

Or, for that matter, having the investigation locate a five-minute change
that produces large benefits.

> > The real-world examples I know of involved garbage collectors, and the
> > improvement was said to be a few percent system-wide. But that was a
> > verbal exchange, so I don't have a citation for you.
>
> Thanks for the example, it sounds very reasonable (at least for platforms
> like PowerPC).
> GC has all the hallmarks of a good optimization target: measurable impact on
> system wide throughput and latency, and pointer chasing (=dependency
> ordering) being a hot part inside the algorithm.

I believe that the target was ARM. It was definitely not PowerPC.

But yes, GCs are indeed intense exercise for pointer chasing.

Thanx, Paul