Re: [PATCH 5/8] x86/mce: Avoid tail copy when machine check terminated a copy from user
From: Borislav Petkov
Date: Wed Sep 16 2020 - 17:01:20 EST
On Tue, Sep 08, 2020 at 10:55:16AM -0700, Tony Luck wrote:
> In the page fault case it is ok to see if a few more unaligned bytes
> can be copied from the source address. Worst case is that the page fault
> will be triggered again.
>
> Machine checks are more serious. Just give up at the point where the
> main copy loop triggered the #MC and return as if the copy succeeded.
>
> [Tried returning bytes not copied here, but that puts the kernel
> into a loop taking the machine check over and over. I don't know
> at what level some code is retrying]
>
> Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
> ---
> arch/x86/lib/copy_user_64.S | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> index 5b68e945bf65..1a58946e7c4e 100644
> --- a/arch/x86/lib/copy_user_64.S
> +++ b/arch/x86/lib/copy_user_64.S
> @@ -15,6 +15,7 @@
> #include <asm/asm.h>
> #include <asm/smap.h>
> #include <asm/export.h>
> +#include <asm/trapnr.h>
>
> .macro ALIGN_DESTINATION
> /* check for bad alignment of destination */
> @@ -221,6 +222,7 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
> * Try to copy last bytes and clear the rest if needed.
> * Since protection fault in copy_from/to_user is not a normal situation,
> * it is not necessary to optimize tail handling.
> + * Don't try to copy the tail if machine check happened
> *
> * Input:
> * rdi destination
> @@ -232,10 +234,15 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
> */
> SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
> movl %edx,%ecx
> + cmp $X86_TRAP_MC,%eax /* check if X86_TRAP_MC */
> + je 3f
> 1: rep movsb
> 2: mov %ecx,%eax
> ASM_CLAC
> ret
> +3: xorl %eax,%eax /* pretend we succeeded? */
Hmm, but copy_*_user returns the uncopied bytes in eax. Users of this
need to handle the MC case properly but if you return 0, they would
think that they copied everything but there's some trailing stuff they
didn't manage to take.
And it's not like they *should* have to retry to copy it because they
will walk right into the faulty region and cause more MCEs.
So how is this "I-got-an-MCE-while-copying-from-user" handled on the
higher level?
Your 7/8 says:
"Add code to recover from a machine check while copying data from user
space to the kernel. Action for this case is the same as if the user
touched the poison directly; unmap the page and send a SIGBUS to the
task."
So how are users of copy_*_user() expected to handle the page
disappearing from under them?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette