Re: [PATCH 1/2] x86/mce: Only restart instruction after machine checkrecovery if it is safe

From: Chen Gong
Date: Fri May 11 2012 - 03:19:56 EST


ä 2012/5/11 2:01, Tony Luck åé:
> Section 15.3.1.2 of the software developer manual has this to say
> about the RIPV bit in the IA32_MCG_STATUS register:
>
> RIPV (restart IP valid) flag, bit 0 â Indicates (when set) that
> program execution can be restarted reliably at the instruction
> pointed to by the instruction pointer pushed on the stack when the
> machine-check exception is generated. When clear, the program
> cannot be reliably restarted at the pushed instruction pointer.
>
> We need to save the state of this bit in do_machine_check() and use
> it in mce_notify_process() to force a signal; even if
> memory_failure() says it made a complete recovery ... e.g. replaced
> a clean LRU page).
>
> Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 9 ++++++--- 1 files changed,
> 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c
> b/arch/x86/kernel/cpu/mcheck/mce.c index 66e1c51..3b8ebdc 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c +++
> b/arch/x86/kernel/cpu/mcheck/mce.c @@ -947,9 +947,10 @@ struct
> mce_info { atomic_t inuse; struct task_struct *t; __u64 paddr; +
> int restartable; } mce_info[MCE_INFO_MAX];
>
> -static void mce_save_info(__u64 addr) +static void
> mce_save_info(__u64 addr, int c) { struct mce_info *mi;
>
> @@ -957,6 +958,7 @@ static void mce_save_info(__u64 addr) if
> (atomic_cmpxchg(&mi->inuse, 0, 1) == 0) { mi->t = current;
> mi->paddr = addr; + mi->restartable = c; return; } } @@ -1136,7
> +1138,7 @@ void do_machine_check(struct pt_regs *regs, long
> error_code) mce_panic("Fatal machine check on current CPU", &m,
> msg); if (worst == MCE_AR_SEVERITY) { /* schedule action before
> return to userland */ - mce_save_info(m.addr); +
> mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
> set_thread_flag(TIF_MCE_NOTIFY); } else if (kill_it) {
> force_sig(SIGBUS, current); @@ -1185,7 +1187,8 @@ void
> mce_notify_process(void)
>
> pr_err("Uncorrected hardware memory error in user-access at %llx",
> mi->paddr); - if (memory_failure(pfn, MCE_VECTOR,
> MF_ACTION_REQUIRED) < 0) { + if (memory_failure(pfn, MCE_VECTOR,
> MF_ACTION_REQUIRED) < 0 || + mi->restartable == 0) {
> pr_err("Memory error not recovered"); force_sig(SIGBUS, current);
> }

How about using following condition to decrease the execution time?
if (mi->restartable == 0 ||
memory_failure(pfn, MCE_VECTOR, MF_ACTION_REQUIRED) < 0)

Since restart operation is impossible, whether recovery operation can
be avoided?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/