Re: [PATCH] PCI: rcar-gen4: Limit Max_Read_Request_Size and Max_Payload_Size to 256 Bytes

From: Marek Vasut

Date: Tue May 12 2026 - 23:09:49 EST


On 5/11/26 4:20 PM, Koichiro Den wrote:

Hello Den-san,

2. Did you also happen to test V4H/V4M in endpoint (EP) mode, with the local
eDMA engine issuing MRd requests toward host memory?

I was not able to test this configuration.

Is it possible to perform this test with a single device, by having the eDMA
do local-memory-read-to-local-memory-write transfers, maybe using
PIPE_LOOPBACK/LOOPBACK_ENABLE bits, or do I need two devices with NTB
connection between them ?

In case it is the later, could you please briefly describe the S4 NTB setup
you use, so I could try to replicate it locally ?

My setup was a two-board setup:

S4 Spider as RC <-> S4 Spider as EP, connected with OCuLink.

It is unfortunately not a small standalone reproducer. The setup was based on
the following RFC v4 series:

[RFC PATCH v4 00/38] NTB transport backed by PCI EP embedded DMA
https://lore.kernel.org/all/20260118135440.1958279-1-den@xxxxxxxxxxxxx/

In particular, the workaround patch I used in the RFC series was:

[RFC PATCH v4 31/38] NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
https://lore.kernel.org/all/20260118135440.1958279-32-den@xxxxxxxxxxxxx/

Note that in that workaround I only capped MRRS (i.e. I did not add an MPS cap).
At least in that setup, avoiding 256B MRd requests was enough to make the
visible corruption disappear.

I have been investigating the MPSS/MPS/MRRS a bit deeper. I did not make the connection between your last observation in your previous email, which already hinted at what the issue might be, the MPSS bitfield, TYPE00 and TYPE01 accesses until today and one more nudge from Manivannan in the MPSS direction. Thank you both for those two items.

It seems that for S4, the latest documentation rev.1.30 indicates EXPCAP1F MPSS as read-write and configurable between 128B and 256B for TYPE00 (EP) access , but read-only and set to fixed 128B for TYPE01 (RC) access .

If the S4 PCIe in RC mode is only capable of 128B long TLPs and in EP mode is capable of 128B or 256B long TLPs, this might explain why you observe corruption with 256B long TLPs between two S4 Spiders. The S4 Spider in EP mode might work just fine with another RC which can do 256B long TLPs.

I still do not understand one more observation -- if I configure V4H PCIe as RC, and read out EXPCAP1F register MPSS field, it reads as 256B (value 3'b001). I would expect EXPCAP1F register MPSS field to read out as (default) 128B in this RC case. The V4H documentation indicates EXPCAP1F MPSS as read-ONLY and set to fixed 256B for TYPE00 (EP) access, but read-WRITE and set to 128B for TYPE01 (RC) access , which I think might be a documentation issue. I also do not rewrite the EXPCAP1F MPSS in any way.

If the V4H is also capable of only 128B TLPs in RC mode, then this patch would require additional adjustment and would have to limit TLP length based on configuration -- 128B for RC, 256B for EP.

I will now ask for documentation clarification.

At a high level, the EP side exposes the vNTB endpoint function, and the RC side
uses the NTB data path which is backed by the EP-local eDMA through that vNTB
function. For the RC-to-EP data path, the EP-local eDMA acts as the requester:
it issues MRd requests toward remote RC memory, receives the CplD payloads, and
writes the data into EP-side memory. In other words, this is a DMA read transfer
from the point of view of the EP-local eDMA.

I understand. If the S4 EP has MPSS set to 256 Bytes (and possibly also MPS), but the S4 RC may (*) be limited to MPSS and MPS 128 Bytes, I wonder if the MRd from the EP-local DMA sent to RC might be causing malfunction on the RC side.

(*) to be determined, I will ask.

I have not tried PIPE_LOOPBACK/LOOPBACK_ENABLE. Given how heavy the setup
described above is, I am not asking you to reproduce the whole thing just for
this patch. Also, I do not want this NTB/eDMA observation to block your v2. For
now, please treat it as a separate observation from the RC/NVMe issue. I will
continue the investigation on my side and let you know if I can narrow down
where the corruption occurs.

I very much appreciate your input, and in light of it, I believe this patch does need an update.

As for local Oculink setup options, I already had a closer look as well.

Your commit message
describes an NVMe device as the requester, but I'm wondering whether the same
256B limit was also verified for the R-Car EP DMA requester path.

This part I currently can not answer, I'm sorry.

...

I made the following two observations in the meantime.

First, I wrote two SSDs, Crucial P5 Plus SSD without HMPRE (without host
memory buffer) and XPG GAMMIX P55 with HMPRE (with host memory buffer) with
4 GiB of random data on another system (iMX8M Plus, ARM64 with DWC PCIe
controller too), then I did a read back and compared the data, the writen
and read-back data matched.

Then I plugged both SSDs into V4H Sparrow Hawk _without_ this patch, and I
did read back of data:

- Crucial P5 Plus SSD without HMPRE (without host memory buffer)
-> Data read back match data written on iMX8M Plus, OK
- XPG GAMMIX P55 with HMPRE (with host memory buffer)
-> Data read back match data written on iMX8M Plus, OK

Then I wrote 512 Byte of data into the Crucial P5 Plus SSD without HMPRE on
V4H Sparrow Hawk and did read back again.
-> Data read back does NOT match data written, NG

That would indicate that:
- WRITE transfers from SSD to DRAM are OK
- READ transfers from DRAM to SSD are corrupted at 256 Bytes boundary

That would indicate that we need _at_least_ the 256 Bytes limit, likely on
both MPS and MRRS.

Second, I got a report of another SSD for which this patch is not
sufficient. I currently do not have access to that SSD, but I will ask for
access and investigate. That may shed some light on the 128 Byte limit
below.

Thank you for sharing these observations.
Interesting, that second point may indeed help determine whether my 128B
observation in the past is related to the same underlying issue, or is a purely
eDMA/NTB-specific one.

Could you please have a look at the beginning of this email too ? I wonder if the TYPE00/TYPE01 accesses might have different TLP size limitations.

(*) The background for my question 2:

I only have access to S4 Spider boards. In my RC <-> EP setup, where the EP
side uses the local eDMA engine to issue MRd requests toward the RC, 256-byte
MRd requests still appear to corrupt the transferred data.

Is the corruption deterministic in some way, i.e. are the same bytes of the
transferred data corrupted every time, or is the corruption "random" ?

The exact corrupted values were not deterministic, but the offsets where the
corruption occurred were fairly consistent.

Let me quote from my earlier RFC patch:
(https://lore.kernel.org/all/20260118135440.1958279-32-den@xxxxxxxxxxxxx/)

[...]
* On some R-Car platforms using the Synopsys DWC PCIe + eDMA we
* observe data corruption on RC->EP Remote DMA Read paths whenever
* the EP issues large MRd requests. The corruption consistently
* hits the tail of each 256-byte segment (e.g. offsets
* 0x00E0..0x00FF within a 256B block, and again at 0x01E0..0x01FF
* for larger transfers).
[...]

I see.

Does the corruption happen even on singular MRd transfer, or does it happen
only when a lot of traffic is sent across the NTB link? I wonder if this
corruption might be DRAM bandwidth related, i.e. whether the DMA does
possibly saturate the DRAM controller with write requests and make the
system run out of DRAM bandwidth.

It occurred even with a single eDMA read transfer. It was not a symptom only
observable under high load.

That rules out my hypothesis that this might be link stability related, or memory or interconnect pressure related. Thank you for this input.

With the following
change on top of your patch, my DMA-read tests become stable:

[...]

One detail which might be important is that limiting only MPS does not appear
to be sufficient in my setup. MPS=128B with MRRS=256B still seems broken,
while MPS=128B with MRRS=128B works fine. I wonder whether this is because
the "MPS" term in the min(MRRS, MPS) limit for DMA read transfers may
effectively be tied to the DMA read buffer segment size / MPSS rather than
only to DevCtl.MPS. I'm not sure about this yet though.

I think setting MPS=128B MRRS=256B only leads to the transfer being split
into 2 x 128B TLPs sent across the PCIe link, but in the end, 2 x 128 Bytes
of data are received (in some order) into the read segment buffer and
reordered, and 1 x 256 Bytes are written from read segment buffer into the
memory as a single write.

In case of MPS=256B MRRS=256B, only one 256B TLP is sent across the link, 1
x 256 Bytes of data are received into the read segment buffer with no
reordering necessary, and 1 x 256 Bytes are still written from read segment
buffer into the memory as a single write.

=> For MPS=128B/MPS=256B and MRRS=256B, there is difference in the
transfer format between PCIe and DMA, but there is no difference
between DMA and DRAM .

But in case of MRRS=128B and transfer of 256 Bytes, 2 x 128 Bytes of data
are received into (separate? (*)) entries in read segment buffer, and 2 x
128 Bytes are written from (separate?) entries in read segment buffer into
the memory as two separate writes . Could this different memory write
pattern be responsible for the (lack of) corruption ?

Do you know whether the data are corrupted on the PCIe-to-DMA side (when the
data are received from the PCIe side and written into the read buffer
segment) or on the DMA-to-DRAM side (on read from read segment buffer or on
write into DRAM) ?

Unfortunately I cannot distinguish these from software alone. I only observed
the final destination buffer contents after the eDMA read transfer completed.

I understand.

(*) Since the read segment buffer has 16 x 256 Byte segments, with 16 DMA
tags and never more than 16 MRd requests in flight, I think it is likely
that each MRd data land in separate read segment buffer segment. But this
information comes from another datasheet, not V4H one.

One more thing I noticed in the manuals:

R-Car S4 R19UH0161EJ0130 Rev.1.30 Jun. 16, 2025:
Type00 MPSS initial = 256B, PCI R, Internal R/W
Type01 MPSS initial = 128B, PCI R, Internal R

R-Car V4H R19UH0186EJ0130 Rev.1.30 Apr. 21, 2025
Type00 MPSS initial = 256B, PCI R, Internal R
Type01 MPSS initial = 128B, PCI R, Internal R/W

I'm still unsure, but this difference might be relevant. In particular, in
V4H/V4M RC mode your patch programs DevCtl.MPS to 256B, but does not change
Type01 MPSS. I wonder if the Type01 MPSS should also be updated to 256B first
on SoCs where the manual says it is writable from the internal bus, or if I'm
missing something here.

This is a very good point.

The R-Car S4 RM Rev.1.20 lists Type00 MPSS as Internal R and Type01 MPSS as
Internal R/W. This was updated in RM Rev.1.30 to Type 00 Internal R/W and
Type 01 Internal R. It is possible this change is going to be added into the
V4H RM in the future too. That would likely imply, that Type01 MPSS is not
programmable.

I don't think Type1 affects RC operation, but does it affect NTB ?

I have no evidence that Type1 affects NTB either. It was just a speculative idea
based on the difference I saw in the manuals.

Your inference, i.e. that the S4 RM Rev.1.30 may reflect the intended access
attributes and the V4H RM may later get a similar correction, sounds reasonable
to me.

I had not checked the S4 RM Rev.1.20, so I missed that change. Thanks for
pointing it out.

I have now checked 4 S4, 10 V4H, 6 V4M reference manual versions and there are subtle changes. I asked for clarification. If I learn anything, I will let you know.

I did not make the connection between your aforementioned observation, MPSS, and TYPE00 and TYPE01 accesses until today, now I realized there might be different TLP size limits for RC and EP modes.

[...]

Thank you for your help!

Thank you for investigating this and for the very helpful analysis.
I will let you know if I find anything more.

Likewise, thank you for your help !

--
Best regards,
Marek Vasut