Re: framebuffer corruption due to overlapping stp instructions on arm64
From: Matt Sealey
Date: Fri Aug 03 2018 - 16:44:49 EST
On 3 August 2018 at 13:25, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>
>
> On Fri, 3 Aug 2018, Ard Biesheuvel wrote:
>
>> Are we still talking about overlapping unaligned accesses here? Or do
>> you see other failures as well?
>
> Yes - it is caused by overlapping unaligned accesses inside memcpy. When I
> put "dmb sy" between the overlapping accesses in
> glibc/sysdeps/aarch64/memcpy.S, this program doesn't detect any memory
> corruption.
It is a symptom of generating reorderable accesses inside memcpy. It's nothing
to do with alignment, per se (see below). A dmb sy just hides the symptoms.
What we're talking about here - yes, Ard, within certain amounts of
reason - is that
you cannot use PCI BAR memory as 'Normal' - certainly never cacheable memory,
but Normal NC isn't good either. That is that your CPU cannot post
writes or reads
towards PCI memory spaces unless it is dealing with it as Device memory or very
strictly controlled use of Normal Non-Cacheable.
I understand why the rest of the world likes to mark stuff as
'writecombine,' but
that's x86-ism, not an Arm memory type.
There is potential for accesses to the same slave from different
masters (or just
different AXI IDs, most cores rotate over 8 or 16 or so for Normal
memory to achieve)
to be reordered. PCIe has no idea what the source was, it will just
accept them in the order it receives them, and also it will be
strictly defined to
manage incoming AXI or ACE transactions (and barriers..) in a way that does
not violate the PCIe memory model - the worst case is deadlocks, the best case
is you see some very strange behavior.
In any case the original ordering of two Normal-NC transactions may
not make it to
the PCIe bridge in the first place which is probably why a DMB
resolves it - it will
force the core to issue them in order and it's likely unless there is
some hyper-complex
multi-pathing going on, they'll stay ordered. If you MUST preserve the
order between
two Normal memory accesses, a barrier is required. The same is true also of any
re-orderable device access.
>> > I tried to run it on system RAM mapped with the NC attribute and I didn't
>> > get any corruption - that suggests the the bug may be in the PCIE
>> > subsystem.
Pure fluke.
I'll give a simple explanation. The Arm Architecture defines
single-copy and multi-copy
atomic transactions. You can treat 'single-copy' to mean that that
transaction cannot
be made partial, or reordered within itself, i.e. it must modify
memory (if it is a store) in
a single swift effort and any future reads from that memory must
return the FULL result
of that write.
Multi-copy means it can be resized and reordered a bit. Will Deacon is
going to crucify
me for simplifying it, but.. let's proceed with a poor example:
STR X0,[X1] on a 32-bit bus cannot ever be single-copy atomic, because
you cannot
write 64-bits of data on a 32-bit bus in a single, unbreakable
transaction. This is because
from one bus cycle to the next, one half of the transaction will be in
a different place. Your
interconnect will have latched and buffered 32-bits and the CPU is
holding the other.
STP X0, X1, [X2] on a 64-bit bus can be single-copy atomic with
respect to the element
size. But it is on the whole multi-copy atomic - that is to say that
it can provide a single
transaction with multiple elements which are transmitted, and those
elements could be
messed with on the way down the pipe.
On a 128-bit bus, you might expect it to be single-copy atomic because
the entire
transaction can be fit into one single data beat, but *it is most
definitely not* according
to the architecture. The data from X0 and X1 may be required to be
stored at *X2 and
*(X2+8), but the architecture doesn't care which one is written first.
Neither does AMBA.
STP is only ever guaranteed to be single-copy atomic with regards to
the element size
(which is the X register in question). If you swap the data around,
and do STP X1, X0,
[X2] you may see a different result dependent on how the processor
decides to pull
data from the register file and in what order. Users of the old 32-bit
ARM STM instruction
will recall that it writes the register list in incrementing order,
lowest register number to
lowest address, so what is the solution for STP? Do you expect expect
X0 to be emitted
on the bus first or the data to be stored in *X2?
It's neither!
That means you can do an STP on one processor and an LDR of one of the 64-bit
words on another processor, and you may be able to see
a) None of the STP transaction
b) X2 is written with the value in X0, but X2+8 is not holding the value in X1
c) b, only reversed
d) What you expect
And this can change dependent on the resizers and bridges and QoS and paths
between a master interface and a slave interface, although a truly
single-copy atomic
transaction going through a downsizer to smaller than the transaction
size is a broken
system design, it may be allowable if the downsizer hazards addresses
to the granularity
of the larger bus size on the read and write channels and will stall
the read until the write
has committed at least to a buffer, or downstream of the downsizer, so
that it will return
on read the full breadth of the memory update.... that's down to the
system designer.
There are plenty of places things like this can happen - in cache
controllers, for
example, and merging store buffers (you may have a 256 bit or 512 bit
buffer, but
only a 128-bit memory interface).
memcpy() as a function nor the loads and stores it makes are not
single-copy atomic,
no transactions need to be with Normal memory, so that merged stores
and linefills
(if cacheable) can be done. Hence, your memcpy() is just randomly
chucking whatever
data it likes to the bus and they'll arrive in any old order,
'writecombine' semantics make
you think you'll only ever see one very large write with all the CPU
activity merged
together - also NOT true.
And the granularity of the hazarding in your system, from the CPU
store buffer to the
bus interface to the interconnect buffering to the PCIe bridge to the
PCIe EP is.. what?
Not the same all the way down, I'll bet you.
It is assuming that Intel writecombine semantics would apply, which to
be truthful are NO
different to the ones of a merging store buffer in an Arm processor
(Intel architecture states
that the writecombine buffer can be flushed at any time with any
amount of actual data,
it might not be the biggest burst you can imagine), but in practice it
tends to be in cache-line
sized chunks with strict incrementing order and subsequent writes due
to the extremely
large pipeline and queueing will be absorbed by the writecombine
buffer almost with
guarantee.
Links is broken. Even on Intel. If you overlap memory transactions and
expect them to be
gathered and reordered to produce nice, ordered non-overlapping
streaming transactions
you'll be sorely disappointed when they don't, which is what is
happening here. The fix is
use barriers - and don't rely on single-copy atomicity (which is the
only saving feature that
would not require you to use a barrier) since this is a situation
where absolutely none is
afforded.
It'd be easier to cross your fingers that the PCIe RC is has a
coherent master port (ACE-Lite
or something fancier) and can snoop into CPU caches. Then you can mark a memory
location in DRAM as Normal Inner/Outer Cacheable Writeback,
Inner/Outer Shareable,
Write-allocate, read-allocate, and you won't even notice your CPU
doing any memory
writes, but yes if you tell a graphics adapter that it's main
framebuffer is in DRAM it might
be a bit slower (to the speed of the PCIe link.. which may affect your
maximum resolution
in some really strange circumstances). If it cannot use a DRAM
framebuffer then I'd have to
wonder why not.. every PCI graphics card I ever used could take any
base address and
the magic of PCI bus mastering would handle it. This is no different
to how you'd use
DRAM as texture memory.. phenomenally slowly, but without having to
worry about any
ordering semantics (except you should flush your data cache to PoC at
the end of every
frame).
Ta,
Matt