Re: [RFC PATCH] x86/acpi: Ignore invalid x2APIC entries

From: Jim Mattson
Date: Tue Oct 15 2024 - 09:28:10 EST


On Mon, Oct 14, 2024 at 8:23 PM Zhang, Rui <rui.zhang@xxxxxxxxx> wrote:
>
> On Mon, 2024-10-14 at 11:00 -0700, Jim Mattson wrote:
> > On Mon, Oct 14, 2024 at 6:05 AM Zhang, Rui <rui.zhang@xxxxxxxxx>
> > wrote:
> > >
> > > > > >
> > > > > > TBH, I'm not sure that there is actually anything wrong with
> > > > > > the
> > > > > > new
> > > > > > numbering scheme.
> > > > > > The topology is reported correctly (e.g. in
> > > > > > /sys/devices/system/cpu/cpu0/topology/thread_siblings_list).
> > > > > > Yet,
> > > > > > the
> > > > > > new enumeration does seem to contradict user expectations.
> > > > > >
> > > > >
> > > > > Well, we can say this is a violation of the ACPI spec.
> > > > > "OSPM should initialize processors in the order that they
> > > > > appear in
> > > > > the
> > > > > MADT." even for interleaved LAPIC and X2APIC entries.
> > > >
> > > > Ah. Thanks. I didn't know that.
> > > >
> > > > > Maybe we need two steps for LAPIC/X2APIC parsing.
> > > > > 1. check if there is valid LAPIC entry by going through all
> > > > > LAPIC
> > > > > entries first
> > > > > 2. parse LAPIC/X2APIC strictly following the order in MADT.
> > > > > (like
> > > > > we do
> > > > > before)
> > > >
> > > > That makes sense to me.
> > > >
> > > > Thanks,
> > > >
> > > > --jim
> > >
> > > Hi, Jim,
> > >
> > > Please check if below patch restores the CPU IDs or not.
> > >
> > > thanks,
> > > rui
> > >
> > > From ec786dfe693cad2810b54b0d8afbfc7e4c4b3f8a Mon Sep 17 00:00:00
> > > 2001
> > > From: Zhang Rui <rui.zhang@xxxxxxxxx>
> > > Date: Mon, 14 Oct 2024 13:26:55 +0800
> > > Subject: [PATCH] x86/acpi: Fix LAPIC/x2APIC parsing order
> > >
> > > On some systems, the same CPU (with same APIC ID) is assigned with
> > > a
> > > different logical CPU id after commit ec9aedb2aa1a ("x86/acpi:
> > > Ignore
> > > invalid x2APIC entries").
> > >
> > > This means Linux enumerates the CPUs in a different order and it is
> > > a
> > > violation of
> > > https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#madt-processor-local-apic-sapic-structure-entry-order
> > > ,
> > >
> > > "OSPM should initialize processors in the order that they appear
> > > in
> > > the MADT"
> > >
> > > The offending commit wants to ignore x2APIC entries with APIC ID <
> > > 255
> > > when valid LAPIC entries exist, so it parses all LAPIC entries
> > > before
> > > parsing any x2APIC entries. This breaks the CPU enumeration order
> > > for
> > > systems that have x2APIC entries listed before LAPIC entries in
> > > MADT.
> > >
> > > Fix the problem by checking the valid LAPIC entries separately,
> > > before
> > > parsing any LAPIC/x2APIC entries.
> > >
> > > Cc: stable@xxxxxxxxxxxxxxx
> > > Reported-by: Jim Mattson <jmattson@xxxxxxxxxx>
> > > Closes:
> > > https://lore.kernel.org/all/20241010213136.668672-1-jmattson@xxxxxxxxxx/
> > > Fixes: ec9aedb2aa1a ("x86/acpi: Ignore invalid x2APIC entries")
> > > Signed-off-by: Zhang Rui <rui.zhang@xxxxxxxxx>
> > > ---
> > > arch/x86/kernel/acpi/boot.c | 50
> > > +++++++++++++++++++++++++++++++++----
> > > 1 file changed, 45 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/acpi/boot.c
> > > b/arch/x86/kernel/acpi/boot.c
> > > index 4efecac49863..c70b86f1f295 100644
> > > --- a/arch/x86/kernel/acpi/boot.c
> > > +++ b/arch/x86/kernel/acpi/boot.c
> > > @@ -226,6 +226,28 @@ acpi_parse_x2apic(union acpi_subtable_headers
> > > *header, const unsigned long end)
> > > return 0;
> > > }
> > >
> > > +static int __init
> > > +acpi_check_lapic(union acpi_subtable_headers *header, const
> > > unsigned long end)
> > > +{
> > > + struct acpi_madt_local_apic *processor = NULL;
> > > +
> > > + processor = (struct acpi_madt_local_apic *)header;
> > > +
> > > + if (BAD_MADT_ENTRY(processor, end))
> > > + return -EINVAL;
> > > +
> > > + /* Ignore invalid ID */
> > > + if (processor->id == 0xff)
> > > + return 0;
> > > +
> > > + /* Ignore processors that can not be onlined */
> > > + if (!acpi_is_processor_usable(processor->lapic_flags))
> > > + return 0;
> > > +
> > > + has_lapic_cpus = true;
> > > + return 0;
> > > +}
> > > +
> > > static int __init
> > > acpi_parse_lapic(union acpi_subtable_headers * header, const
> > > unsigned long end)
> > > {
> > > @@ -257,7 +279,6 @@ acpi_parse_lapic(union acpi_subtable_headers *
> > > header, const unsigned long end)
> > > processor->processor_id, /* ACPI ID
> > > */
> > > processor->lapic_flags &
> > > ACPI_MADT_ENABLED);
> > >
> > > - has_lapic_cpus = true;
> > > return 0;
> > > }
> > >
> > > @@ -1029,6 +1050,8 @@ static int __init
> > > early_acpi_parse_madt_lapic_addr_ovr(void)
> > > static int __init acpi_parse_madt_lapic_entries(void)
> > > {
> > > int count, x2count = 0;
> > > + struct acpi_subtable_proc madt_proc[2];
> > > + int ret;
> > >
> > > if (!boot_cpu_has(X86_FEATURE_APIC))
> > > return -ENODEV;
> > > @@ -1037,10 +1060,27 @@ static int __init
> > > acpi_parse_madt_lapic_entries(void)
> > > acpi_parse_sapic,
> > > MAX_LOCAL_APIC);
> > >
> > > if (!count) {
> > > - count =
> > > acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC,
> > > - acpi_parse_lapic,
> > > MAX_LOCAL_APIC);
> > > - x2count =
> > > acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC,
> > > - acpi_parse_x2apic,
> > > MAX_LOCAL_APIC);
> >
> > The point is moot now, but I don't think the previous code did the
> > right thing when acpi_table_parse_madt() returned a negative value
> > (for errors).
>
> Previous and current code checks for the negative value later after
> parsing both LAPIC and x2APIC.
> so what is the problem you're referring to?
> Do you mean we should error out immediately when parsing LAPIC fails?

My mistake. I should have looked at the full context rather than just
the context of the patch.

> >
> > > + /* Check if there are valid LAPIC entries */
> > > + acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC,
> > > acpi_check_lapic, MAX_LOCAL_APIC);
> >
> > Two comments:
> >
> > 1) Should we check for a return value < 0 here, or just wait for one
> > of the later walks to error out?
>
> I'm okay with both.
>
> > 2) It seems unfortunate to walk the entire table when the first entry
> > may give you the answer, but perhaps modern systems have only X2APIC
> > entries, so we will typically have to walk the entire table anyway.
>
> yeah. There are systems with invalid LAPIC entries first, and
> acpi_parse_entries_array() doesn't support graceful early termination,
> so we have to check all the entries.
>
> >
> > Reviewed-and-tested-by: Jim Mattson <jmattson@xxxxxxxxxx>
>
> Thanks. I will submit the current version to keep your tags.
>
> -rui

Thanks!

--jim