Re: [PATCH 00/14] alpha: cleanups for 6.10
From: Linus Torvalds
Date: Wed May 29 2024 - 21:09:18 EST
On Wed, 29 May 2024 at 11:50, Maciej W. Rozycki <macro@xxxxxxxxxxx> wrote:
>
> The only difference here is that with
> hardware read-modify-write operations atomicity for sub-word accesses is
> guaranteed by the ISA, however for software read-modify-write it has to be
> explictly coded using the usual load-locked/store-conditional sequence in
> a loop.
I have some bad news for you: the old alpha CPU's not only screwed up
the byte/word design, they _also_ screwed up the
load-locked/store-conditional.
You'd think that LL/SC would be done at a cacheline level, like any
sane person would do.
But no.
The 21064 actually did atomicity with an external pin on the bus, the
same way people used to do before caches even existed.
Yes, it has an internal L1 D$, but it is a write-through cache, and
clearly things like cache coherency weren't designed for. In fact,
LL/SC is even documented to not work in the external L2 cache
("Bcache" - don't ask me why the odd naming).
So LL/SC on the 21064 literally works on external memory.
Quoting the reference manual:
"A.6 Load Locked and Store Conditional
The 21064 provides the ability to perform locked memory accesses through
the LDxL (Load_Locked) and STxC (Store_Conditional) cycle command pair.
The LDxL command forces the 21064 to bypass the Bcache and request data
directly from the external memory interface. The memory interface logic must
set a special interlock flag as it returns the data, and may
optionally keep the
locked address"
End result: a LL/SC pair is very very slow. It was incredibly slow
even for the time. I had benchmarks, I can't recall them, but I'd like
to say "hundreds of cycles". Maybe thousands.
So actual reliable byte operations are not realistically possible on
the early alpha CPU's. You can do them with LL/SC, sure, but
performance would be so horrendously bad that it would be just sad.
The 21064A had some "fast lock" mode which allows the data from the
LDQ_L to come from the Bcache. So it still isn't exactly fast, and it
still didn't work at CPU core speeds, but at least it worked with the
external cache.
Compilers will generate the sequence that DEC specified, which isn't
thread-safe.
In fact, it's worse than "not thread safe". It's not even safe on UP
with interrupts, or even signals in user space.
It's one of those "technically valid POSIX", since there's
"sig_atomic_t" and if you do any concurrent signal stuff you're
supposed to only use that type. But it's another of those "Yeah, you'd
better make sure your structure members are either 'int' or bigger, or
never accessed from signals or interrupts, or they might clobber
nearby values".
Linus