Re: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

From: Andrew Morton
Date: Fri Dec 22 2023 - 14:42:43 EST


On Fri, 22 Dec 2023 14:27:06 +0800 Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx> wrote:

> During the process of splitting a hw-poisoned huge page, it is possible
> for the reference count of the huge page to be increased by the threads
> within the affected process, leading to a failure in splitting the
> hw-poisoned huge page with an error code of -EAGAIN.
>
> This issue can be reproduced when doing memory error injection to a
> multiple-thread process, and the error occurs within a huge page.
> The call path with the returned -EAGAIN during the testing is shown below:
>
> memory_failure()
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list() {
> ...
> Step A: can_split_folio() - Checked that the thp can be split.
> Step B: unmap_folio()
> Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
> ...
> }
>
> The testing logs indicated that some huge pages were split successfully
> via the call path above (Step C was successful for these huge pages).
> However, some huge pages failed to split due to a failure at Step C, and
> it was observed that the reference count of the huge page increased between
> Step A and Step C.
>
> Testing has shown that after receiving -EAGAIN, simply re-splitting the
> hw-poisoned huge page within memory_failure() always results in the same
> -EAGAIN. This is possible because memory_failure() is executed in the
> currently affected process. Before this process exits memory_failure() and
> is terminated, its threads could increase the reference count of the
> hw-poisoned page.
>
> Furthermore, if the h/w-poisoned huge page had been mapped for the victim
> application's text and was present in the file cache and it was failed to
> be split. When attempting to restart the process without splitting the
> h/w-poisoned huge page, the application restart failed. This was possible
> because its text was remapped to the hardware-poisoned huge page from the
> file cache, leading to its swift termination due to another MCE.

So we're hoping that when the worker runs to split the page, the
process and its threads have exited. What guarantees this timing?

And we're hoping that the worker has split the page before userspace
attempts to restart the process. What guarantees this timing?

All this reliance upon fortunate timing sounds rather unreliable,
doesn't it?