Re: [PATCH v2] make buffer_locked provide an acquire semantics

From: Paul E. McKenney
Date: Sun Jul 31 2022 - 13:32:40 EST


On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> [ Will and Paul, question for you below ]
>
> On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> >
> > Also, there is this pattern present several times:
> > wait_on_buffer(bh);
> > if (!buffer_uptodate(bh))
> > err = -EIO;
> > It may be possible that buffer_uptodate is executed before wait_on_buffer
> > and it may return spurious error.
>
> I'm not convinced that's actually valid.
>
> They are testing the same memory location, and I don't think our
> memory ordering model allows for _that_ to be out-of-order. Memory
> barriers are for accesses to different memory locations.

Yes, aligned same-sized marked accesses to a given location are always
executed so as to be consistent with some global order. Please note the
"marked accesses". The compiler can rearrange unmarked reads and in
some cases can discard unmarked writes. For more information, please
see tools/memory-model/Documentation/recipes.txt's "Single CPU or single
memory location" section.

> Even alpha is specified to be locally ordered wrt *one* memory
> location, including for reads (See table 5-1: "Processor issue order",
> and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> new value, a subsequent read is not allowed to see an older one - even
> without a memory barrier.
>
> Will, Paul? Maybe that's only for overlapping loads/stores, not for
> loads/loads. Because maybe alpha for once isn't the weakest possible
> ordering.

The "bad boy" in this case is Itanium, which can do some VLIW reordering
of accesses. Or could, I am not sure that newer Itanium hardware
does this. But this is why Itanium compilers made volatile loads use
the ld,acq instruction.

Which means that aligned same-sized marked accesses to a single location
really do execute consistently with some global ordering, even on Itanium.

That said, I confess that I am having a hard time finding the
buffer_locked() definition. So if buffer_locked() uses normal C-language
accesses to sample the BH_Lock bit of the ->b_state field, then yes,
there could be a problem. The compiler would then be free to reorder
calls to buffer_locked() because it could then assume that no other
thread was touching that ->b_state field.

> I didn't find this actually in our documentation, so maybe other
> architectures allow it? Our docs talk about "overlapping loads and
> stores", and maybe that was meant to imply that "load overlaps with
> load" case, but it really reads like load-vs-store, not load-vs-load.

I placed the relevant text from recipes.txt at the end of this email.

> But the patch looks fine, though I agree that the ordering in
> __wait_on_buffer should probably be moved into
> wait_on_bit/wait_on_bit_io.
>
> And would we perhaps want the bitops to have the different ordering
> versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> would seem to be (a) cleaner and (b) possibly generate better code for
> architectures where that makes a difference?

As always, I defer to the architecture maintainers on this one.

Thanx, Paul

------------------------------------------------------------------------

Single CPU or single memory location
------------------------------------

If there is only one CPU on the one hand or only one variable
on the other, the code will execute in order. There are (as
usual) some things to be careful of:

1. Some aspects of the C language are unordered. For example,
in the expression "f(x) + g(y)", the order in which f and g are
called is not defined; the object code is allowed to use either
order or even to interleave the computations.

2. Compilers are permitted to use the "as-if" rule. That is, a
compiler can emit whatever code it likes for normal accesses,
as long as the results of a single-threaded execution appear
just as if the compiler had followed all the relevant rules.
To see this, compile with a high level of optimization and run
the debugger on the resulting binary.

3. If there is only one variable but multiple CPUs, that variable
must be properly aligned and all accesses to that variable must
be full sized. Variables that straddle cachelines or pages void
your full-ordering warranty, as do undersized accesses that load
from or store to only part of the variable.

4. If there are multiple CPUs, accesses to shared variables should
use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
tearing, load/store fusing, and invented loads and stores.
There are exceptions to this rule, including:

i. When there is no possibility of a given shared variable
being updated by some other CPU, for example, while
holding the update-side lock, reads from that variable
need not use READ_ONCE().

ii. When there is no possibility of a given shared variable
being either read or updated by other CPUs, for example,
when running during early boot, reads from that variable
need not use READ_ONCE() and writes to that variable
need not use WRITE_ONCE().