Re: [PATCH 2/3] cxl/core: Enable Region creation on x86 with Low Memory Hole
From: Fabio M. De Francesco
Date: Wed Jan 08 2025 - 09:48:52 EST
Hi Robert,
On Monday, December 16, 2024 10:30:11 PM GMT+1 Robert Richter wrote:
> Hi Fabio,
>
> On 22.11.24 16:51:53, Fabio M. De Francesco wrote:
> > The CXL Fixed Memory Window Structure (CFMWS) describes zero or more Host
> > Physical Address (HPA) windows that are associated with each CXL Host
> > Bridge. Each window represents a contiguous HPA that may be interleaved
> > with one or more targets (CXL v3.1 - 9.18.1.3).
> >
> > The Low Memory Hole (LMH) of x86 is a range of addresses of physical low
> > memory to which systems cannot send transactions. In some cases the size
> > of that hole is not compatible with the CXL hardware decoder constraint
> > that the size is always aligned to 256M * Interleave Ways.
> >
> > On those systems, BIOS publishes CFMWS which communicate the active System
> > Physical Address (SPA) ranges that map to a subset of the Host Physical
> > Address (HPA) ranges. The SPA range trims out the hole, and capacity in
> > the endpoint is lost with no SPA to map to CXL HPA in that hole.
> >
> > In the early stages of CXL Regions construction and attach on platforms
> > with Low Memory Holes, cxl_add_to_region() fails and returns an error
> > because it can't find any CXL Window that matches a given CXL Endpoint
> > Decoder.
> >
> > Detect Low Memory Holes by comparing Root Decoders and Endpoint Decoders
> > ranges: both must start at physical address 0 and end below 4 GB, while
> > the Root Decoder ranges end at lower addresses than the matching Endpoint
> > Decoders which instead must always respect the above-mentioned CXL hardware
> > decoders HPA alignment constraint.
> >
> > Match misaligned CFMWS and CXL Regions with corresponding CXL Endpoint
> > Decoders if driver detects Low Memory Holes.
> >
> > Construct CXL Regions with HPA range's end trimmed by matching SPA.
> >
> > Allow the attach target process to complete by relaxing Decoder constraints
> > which would lead to failures.
>
> I am implementing an address translation series and will post the
> patches in a couple of days. It solves a similar problem to remap
> (zero-based) endpoint address ranges to a different SPA range. I
> implemented a generic approach that can easily be reused here. With
> only a few changes to your series it could be integrated there (I
> could rebase my code onto your's then in a later version of my
> series).
>
> The change I suggest to your series will also significantly simplify
> your code and the function interface. The general idea is to have a
> .spa_range attribute in struct cxl_endpoint_decoder:
>
> @@ -428,6 +428,7 @@ struct cxl_endpoint_decoder {
> struct cxl_decoder cxld;
> struct cxl_root_decoder *cxlrd;
> struct resource *dpa_res;
> + struct range spa_range;
> resource_size_t skip;
> enum cxl_decoder_mode mode;
> enum cxl_decoder_state state;
>
> Now, then an endpoint is detected, the spa_range is assigned,
> typically this is cxled->spa_range = cxled->cxld.hpa_range (SPA ==
> HPA). Whenever a port is added, a function cxl_port_platform_setup()
> is called which could also contain a call to
> cxl_port_setup_intel_lmh() or so.
>
> @@ -865,6 +870,8 @@ static int cxl_port_add(struct cxl_port *port,
> return rc;
> }
>
> + cxl_port_platform_setup(port);
> +
> rc = device_add(dev);
> if (rc)
> return rc;
>
> cxl_port_setup_intel_lmh() does all the platform checks and
> recalculates the spa_range for the endpoints based on the code you
> already have.
>
> The endpoint's SPA address range (instead of hpa_range)
> is now used to construct the region by changing
> match_region_by_range() accordingly. This avoids the instrumentation
> of all the other checks you needed to add in this patch (use of
> arch_match_region() and arch_match_spa() etc.).
>
> The only function that needs to be exported is
> cxl_port_setup_intel_lmh() then. Everything else is generic code. This
> is also a good encapsulation to other archs, vendors, platforms etc. I
> think that would really improve your series. Let me know if you have
> questions.
>
Sorry for this late reply. I just come back from vacation :)
It's too early for asking questions. I'll wait until I read your series.
I just saw you posted it yesterday.
I'll rework my series and just export cxl_port_setup_intel_lmh() as you
suggested, if it would result into a better solution.
I can't yet easily figure out how to detect an LMH while still in cxl_port_add(),
I'll think about it.
>
> See also inline below, I have some questions and comments.
>
> >
> > Cc: Alison Schofield <alison.schofield@xxxxxxxxx>
> > Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> > Cc: Ira Weiny <ira.weiny@xxxxxxxxx>
> > Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@xxxxxxxxxxxxxxx>
> > ---
> > drivers/cxl/Kconfig | 5 ++++
> > drivers/cxl/core/Makefile | 1 +
> > drivers/cxl/core/lmh.c | 53 +++++++++++++++++++++++++++++++++++++
>
> Could you name this file intel.c and have necessary Kconfig options?
>
I named it x86.c, but in internal review Dan asked to rename it lmh.c
>
> If code should be reused later we still can move it to common.c or
> similar. It would be good to guard that having a pci_device_id check
> to a root port or a vendor cpu check.
>
> See my comment no enablement above.
>
> > drivers/cxl/core/region.c | 55 ++++++++++++++++++++++++++++++++-------
> > drivers/cxl/cxl.h | 25 ++++++++++++++++++
> > 5 files changed, 130 insertions(+), 9 deletions(-)
> > create mode 100644 drivers/cxl/core/lmh.c
> >
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index 876469e23f7a..07b87f217e59 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -128,6 +128,11 @@ config CXL_REGION
> >
> > If unsure say 'y'
> >
> > +config CXL_ARCH_LOW_MEMORY_HOLE
> > + def_bool y
> > + depends on CXL_REGION
> > + depends on X86
> > +
> > config CXL_REGION_INVALIDATION_TEST
> > bool "CXL: Region Cache Management Bypass (TEST)"
> > depends on CXL_REGION
> > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> > index 9259bcc6773c..6e80215e8444 100644
> > --- a/drivers/cxl/core/Makefile
> > +++ b/drivers/cxl/core/Makefile
> > @@ -15,4 +15,5 @@ cxl_core-y += hdm.o
> > cxl_core-y += pmu.o
> > cxl_core-y += cdat.o
> > cxl_core-$(CONFIG_TRACING) += trace.o
> > +cxl_core-$(CONFIG_CXL_ARCH_LOW_MEMORY_HOLE) += lmh.o
> > cxl_core-$(CONFIG_CXL_REGION) += region.o
> > diff --git a/drivers/cxl/core/lmh.c b/drivers/cxl/core/lmh.c
> > new file mode 100644
> > index 000000000000..da76b2a534ec
> > --- /dev/null
> > +++ b/drivers/cxl/core/lmh.c
> > @@ -0,0 +1,53 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +#include <linux/range.h>
> > +#include "cxl.h"
> > +
> > +/* In x86 with memory hole, misaligned CFMWS range starts at 0x0 */
> > +#define MISALIGNED_CFMWS_RANGE_BASE 0x0
>
> Could you more elaborate on this? Why is the base zero?
>
I know it because I saw it's zero in all configurations showed by and discussed with
BIOS engineers.
>
> > +
> > +/*
> > + * Match CXL Root and Endpoint Decoders by comparing SPA and HPA ranges.
> > + *
> > + * On x86, CFMWS ranges never intersect memory holes while endpoint decoders
> > + * HPA range sizes are always guaranteed aligned to NIW * 256MB; therefore,
> > + * the given endpoint decoder HPA range size is always expected aligned and
> > + * also larger than that of the matching root decoder
> > + */
> > +bool arch_match_spa(struct cxl_root_decoder *cxlrd,
> > + struct cxl_endpoint_decoder *cxled)
> > +{
> > + struct range *r1, *r2;
> > + int niw;
> > +
> > + r1 = &cxlrd->cxlsd.cxld.hpa_range;
> > + r2 = &cxled->cxld.hpa_range;
> > + niw = cxled->cxld.interleave_ways;
> > +
> > + if (r1->start == MISALIGNED_CFMWS_RANGE_BASE &&
> > + r1->start == r2->start && r1->end < r2->end &&
> > + IS_ALIGNED(range_len(r2), niw * SZ_256M))
> > + return true;
> > + return false;
> > +}
> > +
> > +/* Similar to arch_match_spa(), it matches regions and decoders */
> > +bool arch_match_region(struct cxl_region_params *p,
> > + struct cxl_decoder *cxld)
> > +{
> > + struct range *r = &cxld->hpa_range;
> > + struct resource *res = p->res;
> > + int niw = cxld->interleave_ways;
> > +
> > + if (res->start == MISALIGNED_CFMWS_RANGE_BASE &&
> > + res->start == r->start && res->end < r->end &&
> > + IS_ALIGNED(range_len(r), niw * SZ_256M))
> > + return true;
>
> This basically means that the region base is zero?
>
Yes, it is.
>
> How do you
> determine the actual base address of the range then? The cxl test
> patch contains a non-zero address.
>
Next version of this series is going to use 0x0 for real hardware and
mock_cfmws[0]->base_hpa for the topology created by cxl tests.
>
> this trimmed mem hole regions? Could you share a snippet of a CFMWS
> entry for this?
>
Sorry, I don't have it because no real hardware was required to make and test
my series.
>
> Thanks,
>
> -Robert
>
> > + return false;
> > +}
> > +
> > +void arch_trim_hpa_by_spa(struct resource *res,
> > + struct cxl_root_decoder *cxlrd)
> > +{
> > + res->end = cxlrd->res->end;
> > +}
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index ac2c486c16e9..3cb5a69e9731 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -836,8 +836,12 @@ static int match_auto_decoder(struct device *dev, void *data)
> > cxld = to_cxl_decoder(dev);
> > r = &cxld->hpa_range;
> >
> > - if (p->res && p->res->start == r->start && p->res->end == r->end)
> > - return 1;
> > + if (p->res) {
> > + if (p->res->start == r->start && p->res->end == r->end)
> > + return 1;
> > + if (arch_match_region(p, cxld))
> > + return 1;
> > + }
> >
> > return 0;
> > }
> > @@ -1414,7 +1418,8 @@ static int cxl_port_setup_targets(struct cxl_port *port,
> > if (cxld->interleave_ways != iw ||
> > cxld->interleave_granularity != ig ||
> > cxld->hpa_range.start != p->res->start ||
> > - cxld->hpa_range.end != p->res->end ||
> > + (cxld->hpa_range.end != p->res->end &&
> > + !arch_match_region(p, cxld)) ||
> > ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
> > dev_err(&cxlr->dev,
> > "%s:%s %s expected iw: %d ig: %d %pr\n",
> > @@ -1726,6 +1731,7 @@ static int match_switch_decoder_by_range(struct device *dev, void *data)
> > {
> > struct cxl_endpoint_decoder *cxled = data;
> > struct cxl_switch_decoder *cxlsd;
> > + struct cxl_root_decoder *cxlrd;
> > struct range *r1, *r2;
> >
> > if (!is_switch_decoder(dev))
> > @@ -1735,8 +1741,13 @@ static int match_switch_decoder_by_range(struct device *dev, void *data)
> > r1 = &cxlsd->cxld.hpa_range;
> > r2 = &cxled->cxld.hpa_range;
> >
> > - if (is_root_decoder(dev))
> > - return range_contains(r1, r2);
> > + if (is_root_decoder(dev)) {
> > + if (range_contains(r1, r2))
> > + return 1;
> > + cxlrd = to_cxl_root_decoder(dev);
> > + if (arch_match_spa(cxlrd, cxled))
> > + return 1;
> > + }
> > return (r1->start == r2->start && r1->end == r2->end);
> > }
> >
> > @@ -1943,7 +1954,7 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> > }
> >
> > if (resource_size(cxled->dpa_res) * p->interleave_ways !=
> > - resource_size(p->res)) {
> > + resource_size(p->res) && !arch_match_spa(cxlrd, cxled)) {
> > dev_dbg(&cxlr->dev,
> > "%s:%s: decoder-size-%#llx * ways-%d != region-size-%#llx\n",
> > dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> > @@ -3199,7 +3210,13 @@ static int match_root_decoder_by_range(struct device *dev, void *data)
> > r1 = &cxlrd->cxlsd.cxld.hpa_range;
> > r2 = &cxled->cxld.hpa_range;
> >
> > - return range_contains(r1, r2);
> > + if (range_contains(r1, r2))
> > + return true;
> > +
> > + if (arch_match_spa(cxlrd, cxled))
> > + return true;
> > +
> > + return false;
> > }
> >
> > static int match_region_by_range(struct device *dev, void *data)
> > @@ -3217,8 +3234,12 @@ static int match_region_by_range(struct device *dev, void *data)
> > p = &cxlr->params;
> >
> > down_read(&cxl_region_rwsem);
> > - if (p->res && p->res->start == r->start && p->res->end == r->end)
> > - rc = 1;
> > + if (p->res) {
> > + if (p->res->start == r->start && p->res->end == r->end)
> > + rc = 1;
> > + if (arch_match_region(p, &cxled->cxld))
> > + rc = 1;
> > + }
> > up_read(&cxl_region_rwsem);
> >
> > return rc;
> > @@ -3270,6 +3291,22 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> >
> > *res = DEFINE_RES_MEM_NAMED(hpa->start, range_len(hpa),
> > dev_name(&cxlr->dev));
> > +
> > + /*
> > + * Trim the HPA retrieved from hardware to fit the SPA mapped by the
> > + * platform
> > + */
> > + if (arch_match_spa(cxlrd, cxled)) {
> > + struct range *range = &cxlrd->cxlsd.cxld.hpa_range;
> > +
> > + arch_trim_hpa_by_spa(res, cxlrd);
> > + dev_dbg(cxlmd->dev.parent,
> > + "%s: Trim HPA (%s: %pr) by SPA (%s: %pr)\n",
> > + __func__,
> > + dev_name(&cxlrd->cxlsd.cxld.dev), range,
> > + dev_name(&cxled->cxld.dev), hpa);
> > + }
> > +
> > rc = insert_resource(cxlrd->res, res);
> > if (rc) {
> > /*
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 5406e3ab3d4a..a5ad4499381e 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -902,6 +902,31 @@ void cxl_coordinates_combine(struct access_coordinate *out,
> >
> > bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
> >
> > +#ifdef CONFIG_CXL_ARCH_LOW_MEMORY_HOLE
> > +bool arch_match_spa(struct cxl_root_decoder *cxlrd,
> > + struct cxl_endpoint_decoder *cxled);
> > +bool arch_match_region(struct cxl_region_params *p,
> > + struct cxl_decoder *cxld);
> > +void arch_trim_hpa_by_spa(struct resource *res,
> > + struct cxl_root_decoder *cxlrd);
> > +#else
> > +bool arch_match_spa(struct cxl_root_decoder *cxlrd,
> > + struct cxl_endpoint_decoder *cxled)
> > +{
> > + return false;
> > +}
> > +
> > +bool arch_match_region(struct cxl_region_params *p,
> > + struct cxl_decoder *cxld)
> > +{
> > + return false;
> > +}
> > +
> > +void arch_trim_hpa_by_spa(struct resource *res,
> > + struct cxl_root_decoder *cxlrd)
> > +{ }
> > +#endif /* CXL_ARCH_LOW_MEMORY_HOLE */
> > +
> > /*
> > * Unit test builds overrides this to __weak, find the 'strong' version
> > * of these symbols in tools/testing/cxl/.
>
>