Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

From: HORIGUCHI NAOYA(堀口 直也)
Date: Tue Mar 16 2021 - 02:43:29 EST


On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote:
> >> will memory_failure() find it and unmap it? if succeed, then the current will be
> >> signaled with correct vaddr and shift?
> >
> > That's a very good question. I didn't see a SIGBUS when I first wrote this code,
> > hence all the p->mce_vaddr. But now I'm
> > a) not sure why there wasn't a signal

We have a recent change around this behavior, which might be an answer for you.
See commit 30c9cf492704 ("mm,hwpoison: send SIGBUS to PF_MCE_EARLY processes on action required events").

> > b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS
>
> Tests on upstream kernel today show that memory_failure() is both unmapping the page
> and sending a SIGBUS.

Sorry if I misunderstood the exact problem, but if the point is to allow
user processes to find poisoned virtual address without SIGBUS, one possible
way is to expose hwpoison entry via /proc/pid/pagemap (attached a draft patch below).
With this patch, processes easily find hwpoison entries without any new interface.

>
> My biggest issue with the KERNEL_COPYIN recovery path is that we don't have code
> to mark the page not present while we are still in do_machine_check().

It sounds to me that even if we have such code, it just narrows down the race window
between multiple MCEs on different CPUs. Or does it completely prevent the race?
(Another thread could touch the poisoned page just before the first thread marks
the page non-present?)

> That's resulted
> in recovery working for simple cases where there is a single get_user() call followed by
> an error return if that failed. But more complex cases require more machine checks and
> a touching faith that the kernel will eventually give up trying (spoiler: it sometimes doesn't).
>
> Thanks to the decode of the instruction we do have the virtual address. So we just need
> a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write
> of a "not-present" value. Maybe a different poison type from the one we get from
> memory_failure() so that the #PF code can recognize this as a special case and do any
> other work that we avoided because we were in #MC context.

As you answered in another email, p->mce_vaddr is set only on KERNEL_COPYIN case,
then if p->mce_vaddr is available also for generic SRAR MCE (I'm assuming that we are
talking about issues on race between generic SRAR MCE not only for KERNEL_COPYIN case),
that might be helpful, although it might be hard to implement.
And I'm afraid that walking page table could find the wrong virtual address if a process
have multiple entries to the single page. We could send multiple SIGBUSs for such case,
but maybe that's not an optimal solution.

Thanks,
Naoya Horiguchi

----