Re: [PATCH] ACPI: APEI: GHES: Improve ghes_notify_nmi() status check

From: Shuai Xue

Date: Thu Nov 06 2025 - 21:20:14 EST

在 2025/11/7 02:03, Luck, Tony 写道:

On Thu, Nov 06, 2025 at 01:04:05PM +0800, Shuai Xue wrote:

在 2025/11/6 10:09, Luck, Tony 写道:

On Thu, Nov 06, 2025 at 09:46:33AM +0800, Shuai Xue wrote:

在 2025/11/4 07:05, Tony Luck 写道:

ghes_notify_nmi() is called for every NMI and must check whether the NMI was
generated because an error was signalled by platform firmware.

This check is very expensive as for each registered GHES NMI source it reads
from the acpi generic address attached to this error source to get the physical
address of the acpi_hest_generic_status block. It then checks the "block_status"
to see if an error was logged.

The ACPI/APEI code must create virtual mappings for each of those physical
addresses, and tear them down afterwards. On an Icelake system this takes around
15,000 TSC cycles. Enough to disturb efforts to profile system performance.

Hi, Tony

Interesting.

If I understand correctly, you mean ghes_peek_estatus() and
ghes_clear_estatus().

I conducted performance testing on our system (ARM v8) and found the
following average costs:

- ghes_peek_estatus(): 8,138.3 ns (21,160 cycles)
- ghes_clear_estatus(): 2,038.3 ns (5,300 cycles)

ARM doesn't use the NMI path (HAVE_ACPI_APEI_NMI is only set on X86).
But maybe you are looking at ghes_notify_sea() which seems similar?

Yes. It is measured in ghes_notify_sea().

If that were not bad enough, there are some atomic accesses in the code path
that will cause cache line bounces between CPUs. A problem that gets worse as
the core count increases.

Could you elaborate on which specific atomic accesses you're referring to?

ghes_notify_nmi() starts with code to ensure only one CPU executes the
GHES NMI path.

if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
return ret;

Looks like an optimization to avoid having a bunch of CPUs queue up
waiting for this spinllock:

raw_spin_lock(&ghes_notify_lock_nmi);

when the first one to get it will find and handle the logged error.

If an NMI issued, at last one status block is active. I don't see how
the code path is different.

On X86 NMI doesn't include a vector. There is just one interrupt entry
point for any NMI.

Linux deals with this with each subsystem that may expect an NMI
registering on an NMI notify chain. When an NMI happens the core
code calls each subsystem function to ask:

Was this NMI for you?
Was this NMI for you?
Was this NMI for you?
Was this NMI for you?
Was this NMI for you?
Was this NMI for you?
Was this NMI for you?
Was this NMI for you?

This is a problem when profiling the system with "perf" as there are
thousands of NMIs, and each one calls ghes_notify_nmi() which will look
to see if an error was logged, and (after all the ACPI/APEI mapping
physical addresses, reading, and unmapping) will return NMI_DONE to
indicate that no error was found, so this NMI came from some other
source.

I see, thank you the datails explaination.

But BIOS changes neither the acpi generic address nor the physical address of
the acpi_hest_generic_status block. So this walk can be done once when the NMI is
registered to save the virtual address (unmapping if the NMI is ever unregistered).
The "block_status" can be checked directly in the NMI handler. This can be done
without any atomic accesses.

Resulting time to check that there is not an error record is around 900 cycles.

Reported-by: Andi Kleen <andi.kleen@xxxxxxxxx>
Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>

---
N.B. I only talked to an Intel BIOS expert about this. GHES code is shared by
other architectures, so it would be wise to get confirmation on whether this
assumption applies to all, or is Intel (or X86) specific.

The assumption is "BIOS changes neither the acpi generic address nor the
physical address of the acpi_hest_generic_status block."?

I've consulted with our BIOS experts from both ARM and RISC-V platform
teams, and they confirmed that error status blocks are reserved at boot
time and remain unchanged during runtime.

Thanks. Good to have this confirmation.

---
include/acpi/ghes.h | 1 +
drivers/acpi/apei/ghes.c | 39 ++++++++++++++++++++++++++++++++++++---
2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index ebd21b05fe6e..58655d313a1f 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -29,6 +29,7 @@ struct ghes {
};
struct device *dev;
struct list_head elist;
+ void __iomem *error_status_vaddr;
};
struct ghes_estatus_node {
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 97ee19f2cae0..62713b612865 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -1425,7 +1425,21 @@ static LIST_HEAD(ghes_nmi);
static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
{
static DEFINE_RAW_SPINLOCK(ghes_notify_lock_nmi);
+ bool active_error = false;
int ret = NMI_DONE;
+ struct ghes *ghes;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
+ if (ghes->error_status_vaddr && readl(ghes->error_status_vaddr)) {
+ active_error = true;
+ break;
+ }
+ }
+ rcu_read_unlock();
+
+ if (!active_error)
+ return ret;

Shoud we put active_error into struct ghes? If we know it is active, we
do not need to call __ghes_peek_estatus() to estatus->block_status().

That might be a useful addition. I was primarily concerned in fixing the
"no erroor" case that happens at a very high rate while profiling the
system with "perf".

Do you mean you see "NMI received for unknown reason" when profiling with
"perf"? And the unknown error is handled by ghes_notify_nmi().

This message is printed when every handler on the NMI chain says "not
my interrupt" with the NMI_DONE return value.

$sudo perf script
swapper 0 [148] 1117148.086594: probe:ghes_notify_nmi: (ffffffffb3867210)
ffffffffb3867211 ghes_notify_nmi+0x1 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb303c366 nmi_handle.part.0+0x56 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb3dc7e0a default_do_nmi+0x11a (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb3dc7f40 exc_nmi+0x100 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb3e0169a end_repeat_nmi+0x16 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb3732915 _find_next_and_bit+0x65 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb3177780 update_sg_lb_stats+0xc0 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb317d5dc update_sd_lb_stats.constprop.0+0xcc (/lib/modules/6.6.102+/build/vmlinu>
ffffffffb317d7d1 find_busiest_group+0x41 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb317dc0a load_balance+0x12a (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb317efd4 rebalance_domains+0x2b4 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb31235d2 handle_softirqs+0xd2 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb31238b3 __irq_exit_rcu+0xa3 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb3dca462 sysvec_apic_timer_interrupt+0x72 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb3e00e86 asm_sysvec_apic_timer_interrupt+0x16 (/lib/modules/6.6.102+/build/vmlin>
ffffffffb3dcbbee cpuidle_enter_state+0xbe (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb3adc6b9 cpuidle_enter+0x29 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb31901ea cpuidle_idle_call+0xfa (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb31902db do_idle+0x7b (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb3190536 cpu_startup_entry+0x26 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb30775d5 start_secondary+0x115 (/lib/modules/6.6.102+/build/vmlinux)
ffffffffb30001e7 secondary_startup_64_no_verify+0x182 (/lib/modules/6.6.102+/build/vmlin>

Yes, I even find that ghes_notify_nmi() is called when no perf is used! (:

I see some unknown NMI in production in Intel platform, but I did not
figure out how it happend. Can you help to explain it?

But skipping (or just removing?
__ghes_peek_estatus()) if you have already confirmed that there is
a logged error would be good.

If you can use the same technique for ghes_notify_sea() then it would be
sensible to move the code I added to ghes_nmi_add() to ghes_new() to
save the virtual address for every type of GHES notification.

Sure, I'd like to add it for ghes_notify_sea().

Great. Take my patch and fix it up to cover ghes_notify_sea() as well. I
don't have a system that I can test that on. When you are done I can
retest the ghes_notify_nmi() to check that it still works.

Got it.

Thanks.
Shuai