Re: [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again

From: HORIGUCHI NAOYA(堀口 直也)
Date: Tue Mar 08 2022 - 01:57:06 EST


On Mon, Mar 07, 2022 at 11:07:32AM -0800, Mike Kravetz wrote:
...
> >
> >> +
> >> + /**
> >> + * The page could have changed compound pages due to race window.
> >> + * If this happens just bail out.
> >> + */
> >> + if (!PageHuge(p) || compound_head(p) != head) {
> >> + action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
> >> + res = -EBUSY;
> >> + goto out;
> >> + }
> >
> > Let me have one comment on the diff. The result code MF_MSG_DIFFERENT_COMPOUND
> > might not fit when PageHuge is false in the check (because it's no longer a
> > compound page). Maybe you may invent another result code, or changes
> > MF_MSG_DIFFERENT_COMPOUND (for example) to MF_MSG_DIFFERENT_PAGE_SIZE?
> >
>
> Suppose we do encounter this race. Also, suppose p != head.
> At the beginning of memory_failure_hugetlb, we do:
>
> struct page *head = compound_head(p);
> ...
> if (TestSetPageHWPoison(head))
>
> So, it could be that we set Poison in the 'head' page but the error was really
> in another page. Is that correct?
>
> Now with the race, head is not a huge page and the pages could even be on
> buddy. Does this mean we could have poison set on the wrong page in buddy?

Correct, the race might be rare, but this needs a fix.
I think that setting PageHWPoison first (before taking refcount and page lock)
is the root of all related problems. This behavior came from the original
concept in hwpoison that preventing consumption of corrupted data is the first
priority. But now I think that this makes no sense if we have this kind of bugs.

I'll try to write a patch for this (I only fix memory_failure_hugetlb() first,
but generic path should be fixed later).
Thank you for pointing out.

- Naoya Horiguchi