RE: [PATCH v3 02/22] x86,mce: Delete ist_begin_non_atomic()

From: Luck, Tony
Date: Wed Feb 19 2020 - 17:33:41 EST

Next message: Shuah Khan: "[GIT PULL] Kselftest update for Linux 5.6-rc3"
Previous message: Minchan Kim: "Re: [PATCH v6 0/7] introduce memory hinting API for external process"
In reply to: Andy Lutomirski: "Re: [PATCH v3 02/22] x86,mce: Delete ist_begin_non_atomic()"
Next in thread: Andy Lutomirski: "Re: [PATCH v3 02/22] x86,mce: Delete ist_begin_non_atomic()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> One big question here: are memory failure #MC exceptions synchronous
> or can they be delayed? If we get a memory failure, is it possible
> that the #MC hits some random context and not the actual context where
> the error occurred?

There are a few cases:
1) SRAO (Software recoverable action optional) [Patrol scrub or L3 cache eviction]
These aren't synchronous with any core execution. Using machine check to signal
was probably a mistake - compounded by it being broadcast :-( Could pick any CPU
to handle (actually choose the first to arrive in do_machine_check()). That guy should
arrange to soft offline the affected page. Every CPU can return to what they were doing
before.

2) SRAR (Software recoverable action required)
These are synchronous. Starting with Skylake they may be signaled just to the thread
that hit the poison. Earlier generations broadcast.
2a) Hit in ring3 code ... we want to offline the page and SIGBUS the task(s)
2b) Memcpy_mcsafe() ... kernel has a recovery path. "Return" to the recovery code instead of to the original RIP.
2c) copy_from_user ... not implemented yet. We are in kernel, but would like to treat this like case 2a

3) Fatal
Always broadcast. Some bank has MCi_STATUS.PCC==1. System must be shutdown.

-Tony

Next message: Shuah Khan: "[GIT PULL] Kselftest update for Linux 5.6-rc3"
Previous message: Minchan Kim: "Re: [PATCH v6 0/7] introduce memory hinting API for external process"
In reply to: Andy Lutomirski: "Re: [PATCH v3 02/22] x86,mce: Delete ist_begin_non_atomic()"
Next in thread: Andy Lutomirski: "Re: [PATCH v3 02/22] x86,mce: Delete ist_begin_non_atomic()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]