Re: [PATCH v1 25/29] cxl/amd: Enable Zen5 address translation using ACPI PRMT

From: Robert Richter
Date: Wed Jan 15 2025 - 10:05:36 EST


On 09.01.25 17:25:13, Gregory Price wrote:
> On Tue, Jan 07, 2025 at 03:10:11PM +0100, Robert Richter wrote:
> > Add AMD platform specific Zen5 support for address translation.
>
> Doing some testing here and I'm seeing some odd results, also noticing
> some naming inconsistencies
>
> >
> > +static u64 cxl_zen5_to_hpa(struct cxl_decoder *cxld, u64 hpa)
> > +{
>
> Function name is _to_hpa, but hpa is an argument?

Conversion is always done from (old) HPA to (new) HPA of the parent
port. Note that the HPA of the root port/host bridge is same as SPA.
Port's in between may have an own HPA range.

>
> Should be dpa as argument? Confusing to convert an hpa to an hpa.

We need to handle the decoder address ranges, the argument is always
the HPA range the decoder belongs to. The DPA is only on the device
side which is a different address range compared to the decoders. The
decoders do the interleaving arithmetic too and DPA range may be
different. E.g. the decoders may split requests to different endpoints
depending on the number of interleaving ways and endpoints have their
own (smaller) DPA address ranges then.

>
> ... snip ...
>
> > +#define DPA_MAGIC 0xd20000
> > + base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
> > + spa = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
> > + spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);
>
> For two devices interleaved, the base should be the same, correct?

Same except for the interleaving offset, which is seen below (dev1
shows *100). At this stage we don't know the interleaving position of
the endpoint yet.

>
> example: 2 128GB devices interleaved/normalized:
>
> dev0: base(0xc051a40000) spa(0xc051a48000) spa2(0xc051a47e00)
> dev1: base(0xc051a40100) spa(0xc051a48100) spa2(0xc051a47f00)
>
> I believe these numbers are correct.

Looks good.

>
> (Note: Using PRMT emulation because I don't have a BIOS with this blob,
> but this is the same emulation i have been using for about 4 months now
> with operational hardware, so unless the translation contract changed
> and this code expects something different, it should be correct).
>
> ... snip ...
> > + len = spa - base;
> > + len2 = spa2 - base;
> > +
> > + /* offset = pos * granularity */
> > + if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
> > + ways = 1;
> > + offset = 0;
> > + granularity = 0;
> > + pos = 0;
> > + } else {
> > + ways = len / SZ_16K;
> > + offset = spa & (SZ_16K - 1);
> > + granularity = (len - len2 - SZ_256) / (ways - 1);
> > + pos = offset / granularity;
> > + }
>
> the interleave ways and such calculate out correctly
>
> dev0: ways(0x2) offset(0x0) granularity(0x100) pos(0x0)
> dev1: ways(0x2) offset(0x100) granularity(0x100) pos(0x1)
>
> > +
> > + base = base - DPA_MAGIC * ways - pos * granularity;
> > + spa = base + hpa;
>
> DPA(0)
> dev0: base(0xc050000000) spa(0xc050000000)
> dev1: base(0xc050000000) spa(0xc050000000)
>
> DPA(0x1fffffffff)
> dev0: base(0xc050000000) spa(0xe04fffffff)
> dev1: base(0xc050000000) spa(0xe04fffffff)
>
> The bases seems correct, the SPAs looks suspect.

SPA range length must be 0x4000000000 (2x 128G). That is, upper SPA
must be 0x10050000000 (0xc050000000 + 0x4000000000 - 1). This one is
too short.

The decoder range lengths below look correct (0x2000000000), the
interleaving configuration should be checked for the decoders.

>
> dev1 should have a very different SPA shouldn't it?

No, the HPA range is calculated, not the DPA range. Both endpoints
have the same HPA range, it must be equal and this looks correct. In
the end we calculate the following here (see cxl_find_auto_decoder()):

hpa = cxld->hpa_range;
// endpoint's hpa range is zero-based, equivalent to:
// hpa->start = 0;
// hpa->end = range_len(&hpa) - 1;
base = hpa.start = port->to_hpa(cxld, hpa.start); // HPA(0)
spa = hpa.end = port->to_hpa(cxld, hpa.end)); // HPA(decoder_size - 1)

Again, the HPA is the address the decoder is programmed with. HPA
length is 0x2000000000 (spa - base + 1). The DPA range is (for 2 way)
half it's size. The PRM uses DPA to SPA, but we want to translate HPA
to SPA. That is we need the calculation for.

>
> > +
> > + /*
> > + * Check SPA using a PRM call for the closest DPA calculated
> > + * for the HPA. If the HPA matches a different interleaving
> > + * position other than the decoder's, determine its offset to
> > + * adjust the SPA.
> > + */
> > +
> > + dpa = (hpa & ~(granularity * ways - 1)) / ways
> > + + (hpa & (granularity - 1));
>
> I do not understand this chunk here, we seem to just be chopping the HPA
> in half to acquire the DPA. But the value passed in is already a DPA.
>
> dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
> = 0xfffffffff

HPA is:

HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff

Should calculate for a 2-way config to:

DPA = 0x1fffffffff.

Actual formula:

dpa = HPA div (granularity * ways) * granularity + HPA mod granularity
pos = (HPA mod (granularity * ways)) div granularity

Bits used (e.g. HPA length: 0x4000000000 = 2^38, ways: 2):

hpa = 00000000000000000000000000XXXXXXXXXXXXXXXXXXXXXXXXXXXXXYZZZZZZZZ
dpa = 000000000000000000000000000XXXXXXXXXXXXXXXXXXXXXXXXXXXXXZZZZZZZZ
pos = Y

With:

X ... base part of the address
Y ... interleaving position
Z ... address offset

For DPA the positional bits are removed.

>
> I don't understand why the DPA address is suddenly half (64GB boundary).

There is probably a broken interleaving config causing half the size
of total device mem.

>
> > + offset = hpa & (granularity * ways - 1) & ~(granularity - 1);
> > + offset -= pos * granularity;
> > + spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset;
> > +
> > + dev_dbg(&cxld->dev,
> > + "address mapping found for %s (dpa -> hpa -> spa): %#llx -> %#llx -> %#llx base: %#llx ways: %d pos: %d granularity: %llu\n",
> > + pci_name(pci_dev), dpa, hpa, spa, base, ways, pos, granularity);
> > +
>
> This results in a translation that appears to be wrong:
>
> dev0:
> cxl decoder5.0: address mapping found for 0000:e1:00.0
> (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000
> base: 0xc050000000 ways: 2 pos: 0 granularity: 256
> cxl decoder5.0: address mapping found for 0000:e1:00.0
> (dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff
> base: 0xc050000000 ways: 2 pos: 0 granularity: 256
>
> dev1:
> cxl decoder6.0: address mapping found for 0000:c1:00.0
> (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000
> base: 0xc050000000 ways: 2 pos: 1 granularity: 256
> cxl decoder6.0: address mapping found for 0000:c1:00.0
> (dpa -> hpa -> spa): 0xfffffffff -> 0x1fffffffff -> 0xe04fffffff
> base: 0xc050000000 ways: 2 pos: 1 granularity: 256
>
> These do not look correct.
>
> Is my understanding of the PRMT translation incorrect?
> I expect the following: (assuming one contiguous CFMW)
>
> dev0 (dpa -> hpa -> spa): 0x0 -> 0x0 -> 0xc050000000
> dev1 (dpa -> hpa -> spa): 0x0 -> 0x100 -> 0xc050000100
> dev0 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3ffffffeff -> 0x1004ffffeff
> dev1 (dpa -> hpa -> spa): 0x1fffffffff -> 0x3fffffffff -> 0x1004fffffff

Yes, would be the result without the offset applied for spa2 above.
The check above calculates the *total* length of hpa and spa with out
considering the interleaving position. This is corrected using the
offset. There is no call prm_cxl_dpa_spa(dev0, 0x1fffffffff) that
returns 0x1004fffffff, but we want to check the upper boundery of the
SPA range.

>
> Extra data: here are the programmed endpoint decoder values
>
> [endpoint5/decoder5.0]# cat start size dpa_size interleave_ways interleave_granularity
> 0x0
> 0x2000000000
> 0x0000002000000000
> 1
> 256
>
> [endpoint6/decoder6.0]# cat start size dpa_size interleave_ways interleave_granularity
> 0x0
> 0x2000000000
> 0x0000002000000000
> 1
> 256

This is correct and and must be half the size of the HPA window.

Thanks for testing.

-Robert

>
>
> Anyway, yeah I'm a bit confused how this is all supposed to actually
> work given that both devices translate to the same addresses.
>
> In theory this *should* work since the root decoder covers the whole
> space - as this has been working for me previously with some hacked up
> PRMT emulation code.
>
> [decoder0.0]# cat start size interleave_ways interleave_granularity
> 0xc050000000
> 0x4000000000
> 2
> 256
>
> [decoder1.0]# cat start size interleave_ways interleave_granularity
> 0xc050000000
> 0x4000000000
> 1
> 256
>
> [decoder3.0]# cat start size interleave_ways interleave_granularity
> 0xc050000000
> 0x4000000000
> 1
> 256
>
> [decoder5.0]# cat start size interleave_ways interleave_granularity
> 0x0
> 0x2000000000
> 1
> 256
>
> [decoder6.0]# cat start size interleave_ways interleave_granularity
> 0x0
> 0x2000000000
> 1
> 256
>
> ~Gregory