Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

From: Michal Hocko
Date: Thu Oct 10 2019 - 03:38:38 EST


On Thu 10-10-19 09:27:32, David Hildenbrand wrote:
> On 09.10.19 16:43, Michal Hocko wrote:
> > On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
> >> We should check for pfn_to_online_page() to not access uninitialized
> >> memmaps. Reshuffle the code so we don't have to duplicate the error
> >> message.
> >>
> >> Cc: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> >> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> >> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> >> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> >> ---
> >> mm/memory-failure.c | 14 ++++++++------
> >> 1 file changed, 8 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> index 7ef849da8278..e866e6e5660b 100644
> >> --- a/mm/memory-failure.c
> >> +++ b/mm/memory-failure.c
> >> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> >> if (!sysctl_memory_failure_recovery)
> >> panic("Memory failure on page %lx", pfn);
> >>
> >> - if (!pfn_valid(pfn)) {
> >> + p = pfn_to_online_page(pfn);
> >> + if (!p) {
> >> + if (pfn_valid(pfn)) {
> >> + pgmap = get_dev_pagemap(pfn, NULL);
> >> + if (pgmap)
> >> + return memory_failure_dev_pagemap(pfn, flags,
> >> + pgmap);
> >> + }
> >> pr_err("Memory failure: %#lx: memory outside kernel control\n",
> >> pfn);
> >> return -ENXIO;
> >
> > Don't we need that earlier at hwpoison_inject level?
> >
>
> Theoretically yes, this is another instance. But pfn_to_online_page(pfn)
> alone would not be sufficient as discussed. We would, again, have to
> special-case ZONE_DEVICE via things like get_dev_pagemap() ...
>
> But mm/hwpoison-inject.c:hwpoison_inject() is a pure debug feature either way:
>
> /*
> * Note that the below poison/unpoison interfaces do not involve
> * hardware status change, hence do not require hardware support.
> * They are mainly for testing hwpoison in software level.
> */
>
> So it's not that bad compared to memory_failure() called from real HW or
> drivers/base/memory.c:soft_offline_page_store()/hard_offline_page_store()

Yes, this is just a toy. And yes we need to handle zone device pages
here because a) people likely want to test MCE behavior even on these
pages and b) HW can really trigger MCEs there as well. I was just
pointing that the patch is likely incomplete.

--
Michal Hocko
SUSE Labs