Re: [PATCH v2 3/5] mm/memory-failure: improve memory failure action_result messages

From: Jane Chu
Date: Mon May 20 2024 - 14:31:12 EST


On 5/16/2024 2:46 AM, Oscar Salvador wrote:

On Fri, May 10, 2024 at 12:26:00AM -0600, Jane Chu wrote:
Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
Attemped to document the definition of various action names, and made a few
adjustment to the action_result() calls.

Signed-off-by: Jane Chu <jane.chu@xxxxxxxxxx>
...
+/*
+ * MF_IGNORED - Either the m-f() handler did nothing to recover, or it did
"or if it"
+ * something, then caught in a race condition which renders the effort sort
"it was caught"

I would also add to MF_IGNORED that we mark the page hwpoisoned anyway.
Thanks!  Will fix, and add comments.
@@ -1018,7 +1034,7 @@ static int me_unknown(struct page_state *ps, struct page *p)
{
pr_err("%#lx: Unknown page state\n", page_to_pfn(p));
unlock_page(p);
- return MF_FAILED;
+ return MF_IGNORED;
}
I was confused because I saw you replaced all MF_MSG_UNKNOWN, so I
wondered how we can end up here until I saw this is a catch-all in case
we fail to make sense of the page->flags.
While you are improving this, I would suggest to add a little comment
above the function explaining how we can reach it.
Yes, it's a catch-all versus something that suppose to happen, will add comments.

/*
@@ -2055,6 +2071,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
if (flags & MF_ACTION_REQUIRED) {
folio = page_folio(p);
res = kill_accessing_process(current, folio_pfn(folio), flags);
+ return action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
I do not understand why are you doing this.

First of all, why is this considered a failure? We did not fail this
time, did we? We went right ahead and kill the process which was
re-accessing the hwpoisoned page. Is that considered a failure?

Given that the goal for the m-f() handler is to isolate the poisoned page, so in this case,

even if the kernel has killed the page, nothing else could be done to prevent the page from

being accessed and triggering another MCE, which is, I suppose more sever than triggering a page fault.


Second, you are know supressing -EHWPOISON with whatever action_result()
will gives us, which judging from the actual code would be -EBUSY?
I do not think that that is right, and we should be returning
-EHWPOISON. Or whatever error code kill_accessing_process() gives us.


@@ -2231,6 +2248,7 @@ int memory_failure(unsigned long pfn, int flags)
res = kill_accessing_process(current, pfn, flags);
if (flags & MF_COUNT_INCREASED)
put_page(p);
+ action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
This is not coherent with what you did in try_memory_failure_hugetlb
for MF_MSG_ALREADY_POISONED, I __think__ that in there we should be
doing the same as we do here.


Good catch, thanks.  In both cases, 'res' from kill_accessing_process() should be returned.

Thanks!

-jane