Re: [PATCH v2] mm: hwpoison: coredump: support recovery from dump_user_range()

From: HORIGUCHI NAOYA(堀口 直也)
Date: Thu Apr 20 2023 - 23:14:11 EST


On Thu, Apr 20, 2023 at 11:05:12PM +0800, Kefeng Wang wrote:
>
>
> On 2023/4/20 10:59, Kefeng Wang wrote:
> >
> >
> > On 2023/4/20 10:03, Jane Chu wrote:
> > >
> > > On 4/19/2023 5:03 AM, Kefeng Wang wrote:
> > > >
> > > >
> > > > On 2023/4/19 15:25, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > > > On Tue, Apr 18, 2023 at 05:45:06PM +0800, Kefeng Wang wrote:
> > > > > >
> > > > > >
> > ...
> > > > > > > > @@ -371,6 +372,14 @@ size_t
> > > > > > > > _copy_mc_to_iter(const void *addr, size_t bytes,
> > > > > > > > struct iov_iter *i)
> > > > > > > >    EXPORT_SYMBOL_GPL(_copy_mc_to_iter);
> > > > > > > >    #endif /* CONFIG_ARCH_HAS_COPY_MC */
> > > > > > > > +static void *memcpy_from_iter(struct iov_iter
> > > > > > > > *i, void *to, const void *from,
> > > > > > > > +                 size_t size)
> > > > > > > > +{
> > > > > > > > +    if (iov_iter_is_copy_mc(i))
> > > > > > > > +        return (void *)copy_mc_to_kernel(to, from, size);
> > > > > > >
> > > > > > > Is it helpful to call memory_failure_queue() if
> > > > > > > copy_mc_to_kernel() fails
> > > > > > > due to a memory error?
> > > > > >
> > > > > > For dump_user_range(), the task is dying, if copy incomplete size, the
> > > > > > coredump will fail and task will exit, also memory_failure will
> > > > > > be called by kill_me_maybe(),
> > > > > >
> > > > > >   CPU: 0 PID: 1418 Comm: test Tainted: G   M
> > > > > > 6.3.0-rc5 #29
> > > > > >   Call Trace:
> > > > > >    <TASK>
> > > > > >    dump_stack_lvl+0x37/0x50
> > > > > >    memory_failure+0x51/0x970
> > > > > >    kill_me_maybe+0x5b/0xc0
> > > > > >    task_work_run+0x5a/0x90
> > > > > >    exit_to_user_mode_prepare+0x194/0x1a0
> > > > > >    irqentry_exit_to_user_mode+0x9/0x30
> > > > > >    noist_exc_machine_check+0x40/0x80
> > > > > >    asm_exc_machine_check+0x33/0x40
> > > > >
> > > > > Is this call trace printed out when copy_mc_to_kernel()
> > > > > failed by finding
> > > > > a memory error (or in some testcase using error injection)?
> > > >
> > > > I add dump_stack() into memory_failure() to check whether the poisoned
> > > > memory is called or not, and the call trace shows it do call
> > > > memory_failure(), but I get confused when do the test.
> > > >
> > > > > In my understanding, an MCE should not be triggered when
> > > > > MC-safe copy tries
> > > > > to access to a memory error.  So I feel that we might be talking about
> > > > > different scenarios.
> > > > >
> > > > > When I questioned previously, I thought about the following scenario:
> > > > >
> > > > >    - a process terminates abnormally for any reason like
> > > > > segmentation fault,
> > > > >    - then, kernel tries to create a coredump,
> > > > >    - during this, the copying routine accesses to corrupted
> > > > > page to read.
> > > > >
> > > > Yes, we tested like your described,
> > > >
> > > > 1) inject memory error into a process
> > > > 2) send a SIGABT/SIGBUS to process to trigger the coredump
> > > >
> > > > Without patch, the system panic, and with patch only process exits.
> > > >
> > > > > In this case the corrupted page should not be handled by
> > > > > memory_failure()
> > > > > yet (because otherwise properly handled hwpoisoned page
> > > > > should be ignored
> > > > > by coredump process).  The coredump process would exit with
> > > > > failure with
> > > > > your patch, but then, the corrupted page is still left
> > > > > unhandled and can
> > > > > be reused, so any other thread can easily access to it again.
> > > >
> > > > As shown above, the corrupted page will be handled by
> > > > memory_failure(), but what I'm wondering,
> > > > 1) memory_failure() is not always called
> > > > 2) look at the above call trace, it looks like from asynchronous
> > > >     interrupt, not from synchronous exception, right?
> > > >
> > > > >
> > > > > You can find a few other places (like __wp_page_copy_user
> > > > > and ksm_might_need_to_copy)
> > > > > to call memory_failure_queue() to cope with such unhandled error pages.
> > > > > So does memcpy_from_iter() do the same?
> > > >
> > > > I add some debug print in do_machine_check() on x86:
> > > >
> > > > 1) COW,
> > > >    m.kflags: MCE_IN_KERNEL_RECOV
> > > >    fixup_type: EX_TYPE_DEFAULT_MCE_SAFE
> > > >
> > > >    CPU: 11 PID: 2038 Comm: einj_mem_uc
> > > >    Call Trace:
> > > >     <#MC>
> > > >     dump_stack_lvl+0x37/0x50
> > > >     do_machine_check+0x7ad/0x840
> > > >     exc_machine_check+0x5a/0x90
> > > >     asm_exc_machine_check+0x1e/0x40
> > > >    RIP: 0010:copy_mc_fragile+0x35/0x62
> > > >
> > > >    if (m.kflags & MCE_IN_KERNEL_RECOV) {
> > > >            if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
> > > >                    mce_panic("Failed kernel mode recovery", &m, msg);
> > > >    }
> > > >
> > > >    if (m.kflags & MCE_IN_KERNEL_COPYIN)
> > > >            queue_task_work(&m, msg, kill_me_never);
> > > >
> > > > There is no memory_failure() called when
> > > > EX_TYPE_DEFAULT_MCE_SAFE, also EX_TYPE_FAULT_MCE_SAFE too,
> > > > so we manually add a memory_failure_queue() to handle with
> > > > the poisoned page.
> > > >
> > > > 2) Coredump,  nothing print about m.kflags and fixup_type,
>
> Sorry,I forget to set coredump file size :(
>
> The coredump do trigger the do_machine_check() with same m.kflags and
> fixup_type like cow
>
>
> > > > with above check, add a memory_failure_queue() or memory_failure() seems
> > > > to be needed for memcpy_from_iter(), but it is totally different from
> > > > the COW scenario
> > > >
>
> so the memcpy_from_iter() from coredump is same as cow scenario.

Okay, thank you for confirmation.

>
> > > >
> > > > Another question, other copy_mc_to_kernel() callers, eg,
> > > > nvdimm/dm-writecache/dax, there are not call memory_failure_queue(),
> > > > should they need a memory_failure_queue(), if so, why not add it into
> > > > do_machine_check() ?
> > >
> >
> > What I mean is that EX_TYPE_DEFAULT_MCE_SAFE/EX_TYPE_FAULT_MCE_SAFE
> > is designed to identify fixups which allow in kernel #MC recovery,
> > that is, the caller of copy_mc_to_kernel() must know the source
> > is a user address, so we could add a MCE_IN_KERNEL_COPYIN fro
> > the MCE_SAFE type.
>
> And I think we need the following change for MCE_SAFE copy to set
> MCE_IN_KERNEL_COPYIN.
>
> >
> > diff --git a/arch/x86/kernel/cpu/mce/severity.c
> > b/arch/x86/kernel/cpu/mce/severity.c
> > index c4477162c07d..63e94484c5d6 100644
> > --- a/arch/x86/kernel/cpu/mce/severity.c
> > +++ b/arch/x86/kernel/cpu/mce/severity.c
> > @@ -293,12 +293,11 @@ static noinstr int error_context(struct mce *m,
> > struct pt_regs *regs)
> >         case EX_TYPE_COPY:
> >                 if (!copy_user)
> >                         return IN_KERNEL;
> > -               m->kflags |= MCE_IN_KERNEL_COPYIN;

This change seems to not related to what you try to fix.
Could this break some other workloads like copying from user address?

> >                 fallthrough;
> >
> >         case EX_TYPE_FAULT_MCE_SAFE:
> >         case EX_TYPE_DEFAULT_MCE_SAFE:
> > -               m->kflags |= MCE_IN_KERNEL_RECOV;
> > +               m->kflags |= MCE_IN_KERNEL_RECOV | MCE_IN_KERNEL_COPYIN;
> >                 return IN_KERNEL_RECOV;
> >
> >         default:
> >
> > then we could drop memory_failure_queue(pfn, flags) from cow/ksm copy,
> > or every Machine Check safe memory copy will need a memory_failure_xx()
> > call.
>
> which help use to kill unneeded memory_failure_queue() call, any comments?

I'm not 100% sure that we can safely use queue_task_work() instead of
memory_failure_queue() (due to the difference between workqueue and task
work, which should be recently discussed in thread [1]). So I prefer to
keep the approach of memory_failure_queue() to keep the impact minimum.

[1] https://lore.kernel.org/lkml/20230417011407.58319-1-xueshuai@xxxxxxxxxxxxxxxxx/T/#u

Thanks,
Naoya Horiguchi

>
>
> >
> > +Thomas,who add the two types, could you share some comments about
> > this,thanks.
> >
> > > In the dax case, if the source address is poisoned, and we do follow
> > > up with memory_failure_queue(pfn, flags), what should the value of
> > > the 'flags' be ?
> >
>
> With above diff change, we don't add a memory_failure_queue() into dax too.