Hi Shuai:
We also encountered this problem in production environment, and tried to
solve it by dividing synchronous and asynchronous error handling into different
paths: task work for synchronous error and workqueue for asynchronous error.
The main challenge is how to distinguish synchronous errors in kernel first
mode through APEI, a related discussion is here.[1]
@@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+ corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
In the case of your patch, it is inappropriate to assume that ghes_proc_in_irq() is only
called to handle synchronous error. Firmware could notify all synchronous and asynchronous
error signals to kernel through NMI notification, e.g. SDEI. In this case, asynchronous
error will be treated as synchronous error.
Our colleague Yingwen has submitted a proposal to extend acpi_hest_generic_data::flag (bit 8)
to indicate that the error is a synchronous[2]. Personally speaking, it is a more general
solution and completely solves the problem.
Background:
In ARM world, two type events (Sync/Async) from hardware IP need OS/VMM take different actions.
Current CPER memory error record is not able to distinguish sync/async type event right now.
Current OS/VMM need to take extra actions beyond CPER which is heavy burden to identify the
two type events
Sync event (e.g. CPU consume poisoned data) --> Firmware -> CPER error log --> OS/VMM take recovery action.
Async event (e.g. Memory controller detect UE event) --> Firmware --> CPER error log --> OS take page action.
Proposal:
- In section description Flags field(UEFI spec section N.2, add sync flag as below. OS/VMM
could depend on this flag to distinguish sync/async events.
- Bit8 – sync flag; if set this flag indicates that this event record is synchronous(e.g.
cpu core consumes poison data, then cause instruction/data abort); if not set, this event record is asynchronous.
Best regards,
Yingwen Chen
A RFC patch set based on above proposal is here[3].