Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency

From: Eric W. Biederman
Date: Thu Apr 10 2025 - 00:40:34 EST


"Maciej W. Rozycki" <macro@xxxxxxxxxxx> writes:

> On Wed, 9 Apr 2025, Eric W. Biederman wrote:
>
>> >> So unless you actually *see* the unaligned faults, I really think you
>> >> shouldn't emulate them.
>> >>
>> >> And I'd like to know where they are if you do see them
>>
>> I was nerd sniped by this so I took a look.
>>
>> I have a distinct memory that even the ipv4 stack can generate unaligned
>> loads. Looking at the code in net/ipv4/ip_input.c:ip_rcv_finish_core
>> there are several unprotected accesses to iph->daddr.
>>
>> Which means that if the lower layers ever give something that is not 4
>> byte aligned for ipv4 just reading the destination address will be an
>> unaligned read.
>>
>> There are similar unprotected accesses to the ipv6 destination address
>> but it is declared as an array of bytes. So that address can not
>> be misaligned.
>>
>> There is a theoretical path through 802.2 that adds a 3 byte sap
>> header that could cause problems. We have LLC_SAP_IP defined
>> but I don't see anything calling register_8022_client that would
>> be needed to hook that up to the ipv4 stack.
>>
>> As long as the individual ethernet drivers have the hardware deliver
>> packets 2 bytes into an aligned packet buffer the 14 byte ethernet
>> header will end on a 16 byte aligned location, I don't think there
>> is a way to trigger unaligned behavior with ipv4 or ipv6.
>>
>> Hmm. Looking appletalk appears to be built on top of SNAP.
>> So after the ethernet header processing the code goes through
>> net/llc/llc_input.c:llc_rcv and then net/802/snap_rcv before
>> reaching any of the appletalk protocols.
>>
>> I think the common case for llc would be 3 bytes + 5 bytes for snap,
>> for 8 bytes in the common case. But the code seems to be reading
>> 4 or 5 bytes for llc so I am confused. In either case it definitely
>> appears there are cases where the ethernet headers before appletalk
>> can be an odd number of bytes which has the possibility of unaligning
>> everything.
>>
>> Both of the appletalk protocols appear to make unguarded 16bit reads
>> from their headers. So having a buffer that is only 1 byte aligned
>> looks like it will definitely be a problem.
>
> Thank you for your analysis, really insightful.
>
>> > FWIW, all the major architectures that have variants without
>> > unaligned load/store (arm32, mips, ppc, riscv) trap and emulate
>> > them for both user and kernel access for normal memory, but
>> > they don't emulate it for atomic ll/sc type instructions.
>> > These instructions also trap and kill the task on the
>> > architectures that can do hardware unaligned access (x86
>> > cmpxchg8b being a notable exception).
>
> But all those architectures have 1-byte and 2-byte memory access machine
> instructions as well, and consequently none requires an RMW sequence to
> update such data quantities that implies the data consistency issue that
> we have on non-BWX Alpha.
>
>> I don't see anything that would get atomics involved in the networking
>> stack. No READ_ONCE on packet data or anything like that. I believe
>> that is fairly fundamental as well. Whatever is processing a packet is
>> the only code processing that packet.
>>
>> So I would be very surprised if the kernel needed emulation of any
>> atomics, just emulation of normal unaligned reads. I haven't looked to
>> see if the transmission paths do things that will result in unaligned
>> writes.
>
> The problem we have on the non-BWX Alpha target is that hardware has no
> memory access instructions narrower than 4 bytes. Consequently to write a
> 1- or 2-byte quantity an RMW instruction sequence is required, in the way
> of reading the whole 4-byte quantity, inserting the bytes to be modified,
> and writing the whole 4-byte quantity back to memory. However such a
> sequence is not safe for concurrent writes, as described below.
>
> A pair of concurrent RMW sequences targetting the same part of an aligned
> 4-byte data quantity is not an issue: it's just an execution race and
> software may be prepared for it (or otherwise either prevent the race via
> a mutex or alternatively use an atomic data type along with the associated
> accessors, which will move data locations in memory suitably apart).
>
> The issue is a pair of concurrent RMW sequences targetting different
> parts of the same aligned 4-byte data quantity: software can legitimately
> expect that writes to disjoint memory locations (e.g. adjacent struct
> members, except for bit-fields) won't affect each other. But here where a
> pair of such RMW sequences runs interleaved, the later write to one
> location will clobber the value written previously to the other. So we
> have a data race. Note that no atomicity is concerned here, we are
> talking plain memory writes, such as with ordinary assignments to regular
> variables in C code.
>
> So I have come up with a solution where such RMW sequences are actually
> emitted by GCC as an LDL_L/STL_C atomic access loop which ensures that no
> intervening write has changed the aligned 4-byte data quantity containing
> the 1- or 2-byte quantity accessed. This guarantees consistency of the
> part(s) of the aligned 4-byte data quantity *outside* the 1- or 2-byte
> quantity written. Atomicity is guaranteed by hardware as a side effect,
> but not a part of this Alpha/Linux psABI extension (i.e. not in our
> contract).
>
> For known-unaligned 2-byte quantities (such as packed structure members)
> the compiler knows that they may span 2 aligned 4-byte data quantities and
> produces two LDL_L/STL_C loops with suitable address adjustments and data
> masking. This still guarantess consistency of data *outside* the 2-byte
> quantity written. No atomicity is guaranteed, because parts of the 2-byte
> quantity may be stored by pieces (if the 2-byte quantity is in the middle
> of an aligned 4-byte quantity, then it'll be written twice).
>
> The problem is with the case where the compiler has been told to produce
> code to write an aligned 2-byte quantity, but at run time it turns out
> unaligned. Now we have to emulate the LDL_L and STL_C instructions of the
> atomic access loop or otherwise the code will crash.
>
> My approach for this scenario is simple: LDL_L emulation remembers the
> address accessed and data present in the 2 aligned 4-byte data quantities
> spanned, and STL_C emulation returns failure in the case of an address
> mismatch and otherwise uses two LDL_L/STL_C loops to load the the 2
> aligned 4-byte data quantities by piece, compare each with data retrieved
> previously at LDL_L emulation time, returning failure in the case of a
> mismatch, insert the requested value and then store the resulting
> quantity. Again this guarantees consistency of the parts of the 2 aligned
> 4-byte data quantities *outside* the unaligned 2-byte quantity written.
> And again, no atomicity is guaranteed.
>
> So while there are no atomic operations in our code at the C language
> level, we get them sneaked in by the compiler under our feet to solve the
> data consistency issue. Now if we can ascertain the code paths concerned
> won't ever exercise concurrency, we could tell the compiler not to produce
> these atomics for 1-byte and 2-byte accesses, on a file-by-file or even
> function-by-function basis, but it seems to me like the very maintenance
> effort we want to avoid for a legacy platform. Whereas if we build the
> kernel with the atomics enabled universally, we won't have to be bothered
> with analysing individual cases (at performance cost, but that's assumed).
>
> I've left 8-byte data quantities out for clarity from the consideration
> above; they're used by the compiler as suitable and handled accordingly.
>
> Let me know if you find anything here unclear.

The emulation you are doing makes sense.

Just a few more points. I am not current but I have never seen
concurrency (inside of a packet) at the network layer.

I don't recall ever hearing the write paths in the network stack
were ever a problem.

I suspect the write side you can verify fairly easily by simply
compiling in appletalk and opening a PF_APPLETALK socket, and sending a
message. If that doesn't trigger emulation I can't image any other
write path in the kernel will.

Eric