Re: [RFC PATCH] resource: Fix CXL node not populated issue
From: Dan Williams
Date: Fri Dec 06 2024 - 02:51:06 EST
Raghavendra K T wrote:
>
>
> On 12/4/2024 9:25 AM, Dan Williams wrote:
> > [ add regressions@xxxxxxxxxxxxxxx ]
> >
> > Next time make the subject of the patch:
> >
> > Revert "resource: fix region_intersects() vs add_memory_driver_managed()"
> >
> > ...to make it clear that this is a revert, not a fix.
> >
> > The revert should be applied if a fix does not materialize in the next few weeks.
> >
>
> Agreed regarding fix.
> one thing to note is it is not exact revert.
>
> > Raghavendra K T wrote:
> >> Before:
> >> ~]$ numastat -m
> >> ...
> >> Node 0 Node 1 Total
> >> --------------- --------------- ---------------
> >> MemTotal 128096.18 128838.48 256934.65
> >>
> >> After:
> >> $ numastat -m
> >> .....
> >> Node 0 Node 1 Node 2 Total
> >> --------------- --------------- --------------- ---------------
> >> MemTotal 128054.16 128880.51 129024.00 385958.67
> >>
> >> Current patch reverts the effect of first commit where the issue is seen.
> >
> > Might you be able to dig a bit further into the details like memory map
> > for this platform and ACPI SRAT tables? A dmesg comparison of the good
> > and bad cases would be useful (those can be shared via a github gist).
> > Even better would be some debug instrumentation to identify which call
> > to __region_intersects() started behaving differently resulting in a
> > whole node disappearing.
> >
> > In terms of the urgency of fixing this it would also help to know how
> > prevalent the system this was found on is in the wild.
>
> I have compared dmesg, proc/iomem of both success and fail case.
>
> A. dmesg:
>
> 1. Address ranges is different
> 2. extra message about printing Demotion target
>
> Fallback order for Node 0: 0 1 2
> Fallback order for Node 1: 1 0 2
> Fallback order for Node 2: 2 0 1
> Built 3 zonelists, mobility grouping on. Total pages: 66145521
> Policy zone: Normal
> ....
> Demotion targets for Node 0: preferred: 2, fallback: 2
> Demotion targets for Node 1: preferred: 2, fallback: 2
> Demotion targets for Node 2: null
>
> B. /proc/iomem
>
> $ vimdiff success fail
>
> 4050000000-604fffffff : Soft Reserved
> | 164 4050000000-604fffffff : Soft Reserved
> 165 4050000000-604fffffff : CXL Window 0
> | 165 4050000000-604fffffff : CXL Window 0
> 166 4080000000-5fffffffff : dax1.0
> |
> ------------------------------------------------------------------------
> 167 4080000000-5fffffffff : System RAM (kmem)
> |
> --------------------------------------------------------------------
My eyes only know how to read unified diff (diff -u) format. Is this
saying that in the failure case the System RAM range for dax1.0 is
missing?