Re: [PATCH v2 3/5] mm/memory-failure: improve memory failure action_result messages

From: Oscar Salvador
Date: Thu May 16 2024 - 05:46:26 EST


On Fri, May 10, 2024 at 12:26:00AM -0600, Jane Chu wrote:
> Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
> Attemped to document the definition of various action names, and made a few
> adjustment to the action_result() calls.
>
> Signed-off-by: Jane Chu <jane.chu@xxxxxxxxxx>
..
> +/*
> + * MF_IGNORED - Either the m-f() handler did nothing to recover, or it did
"or if it "
> + * something, then caught in a race condition which renders the effort sort
"it was caught"

I would also add to MF_IGNORED that we mark the page hwpoisoned anyway.

> @@ -1018,7 +1034,7 @@ static int me_unknown(struct page_state *ps, struct page *p)
> {
> pr_err("%#lx: Unknown page state\n", page_to_pfn(p));
> unlock_page(p);
> - return MF_FAILED;
> + return MF_IGNORED;
> }

I was confused because I saw you replaced all MF_MSG_UNKNOWN, so I
wondered how we can end up here until I saw this is a catch-all in case
we fail to make sense of the page->flags.
While you are improving this, I would suggest to add a little comment
above the function explaining how we can reach it.

> /*
> @@ -2055,6 +2071,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
> if (flags & MF_ACTION_REQUIRED) {
> folio = page_folio(p);
> res = kill_accessing_process(current, folio_pfn(folio), flags);
> + return action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);

I do not understand why are you doing this.

First of all, why is this considered a failure? We did not fail this
time, did we? We went right ahead and kill the process which was
re-accessing the hwpoisoned page. Is that considered a failure?

Second, you are know supressing -EHWPOISON with whatever action_result()
will gives us, which judging from the actual code would be -EBUSY?
I do not think that that is right, and we should be returning
-EHWPOISON. Or whatever error code kill_accessing_process() gives us.


> @@ -2231,6 +2248,7 @@ int memory_failure(unsigned long pfn, int flags)
> res = kill_accessing_process(current, pfn, flags);
> if (flags & MF_COUNT_INCREASED)
> put_page(p);
> + action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);

This is not coherent with what you did in try_memory_failure_hugetlb
for MF_MSG_ALREADY_POISONED, I __think__ that in there we should be
doing the same as we do here.



--
Oscar Salvador
SUSE Labs