RE: [PATCH 1/1] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

From: Zhuo, Qiuxu
Date: Wed Dec 20 2023 - 03:44:43 EST


Hi Naoya Horiguchi,

Thanks for the review.
See the comments below.

> From: Naoya Horiguchi <naoya.horiguchi@xxxxxxxxx>
> Sent: Tuesday, December 19, 2023 10:17 AM
> ...
> > The kernel log (before):
> > [ 1116.862895] Memory failure: 0x4097fa7: recovery action for
> > unsplit thp: Ignored
> >
> > The kernel log (after):
> > [ 793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp:
> Delayed
> > [ 793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.
>
> I'm unclear about the user-visible benefit of ensuring that the error thp is
> split.
> So could you explain about it?

During our testing, we observed that the hardware-poisoned huge page had been
mapped for the victim application's text and was present in the file cache.
Unfortunately, when attempting to restart the application without splitting the thp,
the application restart failed. This was possible because its text was remapped to the
hardware-poisoned huge page from the file cache, leading to its swift termination
due to another MCE.

So, after re-splitting the unsplit thp successfully (drop the text mapping),
the application restart is successful. I'll also add this description in the commit message in the v2.

> I think that the raw error page is not unmapped (with hwpoisoned entry)
> after delayed re-splitting, so recovery action seems not complete even with
> this patch.
> So this patch seems to just convert a hwpoisoned unrecovered thp into a
> hwpoisoned unrecovered raw page.

You're correct. Thanks for catching this.
Instead of creating a new work just to split the thp, I'll leverage the existing memory_failure_queue()
to re-split the thp in the v2, which should make the recovery action more complete.

-Qiuxu