Re: [PATCH 2/3] cxl/core: Enable Region creation on x86 with Low Memory Hole

From: Fabio M. De Francesco
Date: Fri Jan 10 2025 - 11:07:12 EST


Robert,

On Thursday, January 9, 2025 11:58:37 AM GMT+1 Robert Richter wrote:
> Fabio,
>
> On 08.01.25 15:48:34, Fabio M. De Francesco wrote:
> > Hi Robert,
> >
> > On Monday, December 16, 2024 10:30:11 PM GMT+1 Robert Richter wrote:
> > > Hi Fabio,
> > >
> > > On 22.11.24 16:51:53, Fabio M. De Francesco wrote:
> > > > The CXL Fixed Memory Window Structure (CFMWS) describes zero or more Host
> > > > Physical Address (HPA) windows that are associated with each CXL Host
> > > > Bridge. Each window represents a contiguous HPA that may be interleaved
> > > > with one or more targets (CXL v3.1 - 9.18.1.3).
> > > >
> > > > The Low Memory Hole (LMH) of x86 is a range of addresses of physical low
> > > > memory to which systems cannot send transactions. In some cases the size
> > > > of that hole is not compatible with the CXL hardware decoder constraint
> > > > that the size is always aligned to 256M * Interleave Ways.
> > > >
> > > > On those systems, BIOS publishes CFMWS which communicate the active System
> > > > Physical Address (SPA) ranges that map to a subset of the Host Physical
> > > > Address (HPA) ranges. The SPA range trims out the hole, and capacity in
> > > > the endpoint is lost with no SPA to map to CXL HPA in that hole.
> > > >
> > > > In the early stages of CXL Regions construction and attach on platforms
> > > > with Low Memory Holes, cxl_add_to_region() fails and returns an error
> > > > because it can't find any CXL Window that matches a given CXL Endpoint
> > > > Decoder.
> > > >
> > > > Detect Low Memory Holes by comparing Root Decoders and Endpoint Decoders
> > > > ranges: both must start at physical address 0 and end below 4 GB, while
> > > > the Root Decoder ranges end at lower addresses than the matching Endpoint
> > > > Decoders which instead must always respect the above-mentioned CXL hardware
> > > > decoders HPA alignment constraint.
> > > >
> > > > Match misaligned CFMWS and CXL Regions with corresponding CXL Endpoint
> > > > Decoders if driver detects Low Memory Holes.
> > > >
> > > > Construct CXL Regions with HPA range's end trimmed by matching SPA.
> > > >
> > > > Allow the attach target process to complete by relaxing Decoder constraints
> > > > which would lead to failures.
> > >
> > > [...]
> > >

> > > I am implementing an address translation series and will post the
> > > patches in a couple of days. It solves a similar problem to remap
> > > (zero-based) endpoint address ranges to a different SPA range. I
> > > implemented a generic approach that can easily be reused here. With
> > > only a few changes to your series it could be integrated there (I
> > > could rebase my code onto your's then in a later version of my
> > > series).
> > >
> > > The change I suggest to your series will also significantly simplify
> > > your code and the function interface. The general idea is to have a
> > > .spa_range attribute in struct cxl_endpoint_decoder:
> > >
> > > @@ -428,6 +428,7 @@ struct cxl_endpoint_decoder {
> > > struct cxl_decoder cxld;
> > > struct cxl_root_decoder *cxlrd;
> > > struct resource *dpa_res;
> > > + struct range spa_range;
> > > resource_size_t skip;
> > > enum cxl_decoder_mode mode;
> > > enum cxl_decoder_state state;
> > >
> > > Now, then an endpoint is detected, the spa_range is assigned,
> > > typically this is cxled->spa_range = cxled->cxld.hpa_range (SPA ==
> > > HPA). Whenever a port is added, a function cxl_port_platform_setup()
> > > is called which could also contain a call to
> > > cxl_port_setup_intel_lmh() or so.
> > >
> > > @@ -865,6 +870,8 @@ static int cxl_port_add(struct cxl_port *port,
> > > return rc;
> > > }
> > >
> > > + cxl_port_platform_setup(port);
> > > +
> > > rc = device_add(dev);
> > > if (rc)
> > > return rc;
> > >
> > > cxl_port_setup_intel_lmh() does all the platform checks and
> > > recalculates the spa_range for the endpoints based on the code you
> > > already have.
> > >
> > > The endpoint's SPA address range (instead of hpa_range)
> > > is now used to construct the region by changing
> > > match_region_by_range() accordingly. This avoids the instrumentation
> > > of all the other checks you needed to add in this patch (use of
> > > arch_match_region() and arch_match_spa() etc.).
> > >
> > > The only function that needs to be exported is
> > > cxl_port_setup_intel_lmh() then. Everything else is generic code. This
> > > is also a good encapsulation to other archs, vendors, platforms etc. I
> > > think that would really improve your series. Let me know if you have
> > > questions.
> > >
> > Sorry for this late reply. I just come back from vacation :)
> >
> > It's too early for asking questions. I'll wait until I read your series.
> > I just saw you posted it yesterday.
>
> np, my bad, it took longer than expected to send this out. Take your
> time.
>
No problem. Yesterday, I began to read your series. I'll comment and
review there as soon as possible.

Thank you for adding me to the list of recipients.
>
> >
> > I'll rework my series and just export cxl_port_setup_intel_lmh() as you
> > suggested, if it would result into a better solution.
>
> In a v2 I am going to reorder and possibly split the patches for you
> to reuse the first part then.
>
Thanks for that, but I can't yet have visibility on what is going to happen.
Internally, I've been suggested to post my v2 as is, not re-based on your
series for now. A few colleagues want to have the opportunity to read it.
Later we will see which the best course of action will be.
> >
> > I can't yet easily figure out how to detect an LMH while still in cxl_port_add(),
>
> I think you need to detect it close before region setup when decoders
> are enumerated. I will revisit your patch again and take look to find
> a solution.
>
Thanks, but I think you missed something. More on this below...
>
> > I'll think about it.
> > >
> > > See also inline below, I have some questions and comments.
> > >
> > > >
> > > > Cc: Alison Schofield <alison.schofield@xxxxxxxxx>
> > > > Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> > > > Cc: Ira Weiny <ira.weiny@xxxxxxxxx>
> > > > Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@xxxxxxxxxxxxxxx>
> > > > ---
> > > > drivers/cxl/Kconfig | 5 ++++
> > > > drivers/cxl/core/Makefile | 1 +
> > > > drivers/cxl/core/lmh.c | 53 +++++++++++++++++++++++++++++++++++++
> > >
> > > Could you name this file intel.c and have necessary Kconfig options?
> > >
> > I named it x86.c, but in internal review Dan asked to rename it lmh.c
>
> The LMH detection is unclear yet, just checking for zero is IMO not
> sufficient (looked at your answer below). As long as there is no
> generic check a platform check is needed in addition.
>
Thank you for providing your proposed response. I'll now review it for
clarity and fluency, and offer suggestions for improvement:

It's not just checking for zero. In the 'if' statements of
arch_match_spa() and arch_match_region(), there's a bit of more complexity
involved.

These functions essentially test CFMWS and corresponding EP Decoders
when both ranges start at zero. In such cases, if SPA is a subset of EP
Decoder HPA, and that HPA aligns with the 256M * NIW constraints, we assume
an LMH is truncating the CFMWS range. We then return true to match Root
Decoders to Endpoint Decoders, or Regions to EP Decoders, preventing driver
failures while it constructs Regions and attach EP to them.

I believe I discussed this with Ming Li, who had similar questions. However,
the lack of immediate clarity suggests that I should rework these 'if'
statements.

A revised approach for v2 might test whether an SPA range that starts at
zero is shorter than the corresponding EP Deccoder HPA range AND ends before
4G. Again, if it holds true, assume an LMH is truncating that published CFMWS
and so return true and preventing driver failures.

I'll think about it for v2. Anyway I'm going to submit it in one or two days.
>
> > >
> > > If code should be reused later we still can move it to common.c or
> > > similar. It would be good to guard that having a pci_device_id check
> > > to a root port or a vendor cpu check.

That sounds good. I'll think about it. Sorry for missing it in my previous
reply.

> > >
> > > See my comment no enablement above.
> > >
> > > > drivers/cxl/core/region.c | 55 ++++++++++++++++++++++++++++++++-------
> > > > drivers/cxl/cxl.h | 25 ++++++++++++++++++
> > > > 5 files changed, 130 insertions(+), 9 deletions(-)
> > > > create mode 100644 drivers/cxl/core/lmh.c
> > > >
> > > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > > index 876469e23f7a..07b87f217e59 100644
> > > > --- a/drivers/cxl/Kconfig
> > > > +++ b/drivers/cxl/Kconfig
> > > > @@ -128,6 +128,11 @@ config CXL_REGION
> > > >
> > > > If unsure say 'y'
> > > >
> > > > +config CXL_ARCH_LOW_MEMORY_HOLE
> > > > + def_bool y
> > > > + depends on CXL_REGION
> > > > + depends on X86
> > > > +
> > > > config CXL_REGION_INVALIDATION_TEST
> > > > bool "CXL: Region Cache Management Bypass (TEST)"
> > > > depends on CXL_REGION
> > > > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> > > > index 9259bcc6773c..6e80215e8444 100644
> > > > --- a/drivers/cxl/core/Makefile
> > > > +++ b/drivers/cxl/core/Makefile
> > > > @@ -15,4 +15,5 @@ cxl_core-y += hdm.o
> > > > cxl_core-y += pmu.o
> > > > cxl_core-y += cdat.o
> > > > cxl_core-$(CONFIG_TRACING) += trace.o
> > > > +cxl_core-$(CONFIG_CXL_ARCH_LOW_MEMORY_HOLE) += lmh.o
> > > > cxl_core-$(CONFIG_CXL_REGION) += region.o
> > > > diff --git a/drivers/cxl/core/lmh.c b/drivers/cxl/core/lmh.c
> > > > new file mode 100644
> > > > index 000000000000..da76b2a534ec
> > > > --- /dev/null
> > > > +++ b/drivers/cxl/core/lmh.c
> > > > @@ -0,0 +1,53 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +
> > > > +#include <linux/range.h>
> > > > +#include "cxl.h"
> > > > +
> > > > +/* In x86 with memory hole, misaligned CFMWS range starts at 0x0 */
> > > > +#define MISALIGNED_CFMWS_RANGE_BASE 0x0
> > >
> > > Could you more elaborate on this? Why is the base zero?
> > >
> > I know it because I saw it's zero in all configurations showed by and discussed with
> > BIOS engineers.
>
> As said above, an additional check is needed here.
>
Okay.
>
> [...]
>
> > > > +/* Similar to arch_match_spa(), it matches regions and decoders */
> > > > +bool arch_match_region(struct cxl_region_params *p,
> > > > + struct cxl_decoder *cxld)
> > > > +{
> > > > + struct range *r = &cxld->hpa_range;
> > > > + struct resource *res = p->res;
> > > > + int niw = cxld->interleave_ways;
> > > > +
> > > > + if (res->start == MISALIGNED_CFMWS_RANGE_BASE &&
> > > > + res->start == r->start && res->end < r->end &&
> > > > + IS_ALIGNED(range_len(r), niw * SZ_256M))
> > > > + return true;
> > >
> > > This basically means that the region base is zero?
> > >
> > Yes, it is.
> > >
> > > How do you
> > > determine the actual base address of the range then? The cxl test
> > > patch contains a non-zero address.
> > >
> > Next version of this series is going to use 0x0 for real hardware and
> > mock_cfmws[0]->base_hpa for the topology created by cxl tests.
> > >
> > > this trimmed mem hole regions? Could you share a snippet of a CFMWS
> > > entry for this?
> > >
> > Sorry, I don't have it because no real hardware was required to make and test
> > my series.
>
> Ok, will reorder my patches in v2 for you to reuse and take a look at
> your patch and the base_hpa setup. Will comment there again.
>
> Thanks,
>
> -Robert
>
Thanks,

Fabio