On Mon, 14 Oct 2024 16:42:40 +0800
Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> wrote:
The memory uncorrected error could be signaled by asynchronous interrupt
(specifically, SPI in arm64 platform), e.g. when an error is detected by
a background scrubber, or signaled by synchronous exception
(specifically, data abort excepction in arm64 platform), e.g. when a CPU
exception
tries to access a poisoned cache line. Currently, both synchronous andmemory_failure() to execute in kworker context.
asynchronous error use memory_failure_queue() to schedule
memory_failure() exectute in kworker context.
(spell check in general)
context, memory_failure():
As a result, when a user-space process is accessing a poisoned data, a
data abort is taken and the memory_failure() is executed in the kworker
context:
//No subject of the following two bullets otherwise.
- will send wrong si_code by SIGBUS signal in early_kill mode, and
- can not kill the user-space in some cases resulting a synchronous
error infinite loop
Issue 1: send wrong si_code in early_kill mode
Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as
MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
could be used to determine whether a synchronous exception occurs on
ARM64 platform. When a synchronous exception is detected, the kernel is
expected to terminate the current process which has accessed poisoned
page. This is done by sending a SIGBUS signal with an error code
BUS_MCEERR_AR, indicating an action-required machine check error on
read.
However, when kill_proc() is called to terminate the processes who have
the poisoned page mapped, it sends the incorrect SIGBUS error code
BUS_MCEERR_AO because the context in which it operates is not the one
where the error was triggered.
To reproduce this problem:
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1
# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 5 addr 0xffffb0d75000
page not present
Test passed
The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO
error and it is not fact.
After this patch:
# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1
# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed
The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR
error as we expected.
Issue 2: a synchronous error infinite loop
If a user-space process, e.g. devmem, a poisoned page which has been set
devmem accesses a poisoned page for which the HWPoison flag is set
HWPosion flag, kill_accessing_process() is called to send SIGBUS to the
current processs with error info. Because the memory_failure() is
executed in the kworker contex, it will just do nothing but return
context
EFAULT. So, devmem will access the posioned page and trigger an
excepction again, resulting in a synchronous error infinite loop. Such
exception
loop may cause platform firmware to exceed some threshold and rebootOne other trivial comment inline.
when Linux could have recovered from this error.
To reproduce this problem:
# STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed
# STEP 2: access the same page and it will trigger a synchronous error infinite loop
devmem 0x4092d55b400
To fix above two issues, queue memory_failure() as a task_work so that it runs in
the context of the process that is actually consuming the poisoned data.
Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
Tested-by: Ma Wupeng <mawupeng1@xxxxxxxxxx>
Reviewed-by: Kefeng Wang <wangkefeng.wang@xxxxxxxxxx>
Reviewed-by: Xiaofei Tan <tanxiaofei@xxxxxxxxxx>
Reviewed-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
Whilst this also looks fine to me, there are others who (hopefully)
understand these paths better than me. With that said.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
---
drivers/acpi/apei/ghes.c | 78 +++++++++++++++++++++++-----------------
include/acpi/ghes.h | 3 --
include/linux/mm.h | 1 -
mm/memory-failure.c | 13 -------
4 files changed, 45 insertions(+), 50 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f2ee28c44d7a..95e9520eb803 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -467,28 +467,42 @@ static void ghes_clear_estatus(struct ghes *ghes,
}
/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct ghes_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: work control flags
+ *
+ * Structure to pass task work to be handled before
+ * returning to user-space via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct ghes_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ struct ghes_task_work *twcb = container_of(twork, struct ghes_task_work, twork);
+ unsigned long pfn = twcb->pfn;
This local variable is not used consistently. I'd just
drop it in favor of always accessing via twcb->pfn