Re: [RFC v3 PATCH 0/5] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)

From: Yang Shi
Date: Tue Oct 12 2021 - 23:09:41 EST


On Tue, Oct 12, 2021 at 7:41 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Thu, Sep 30, 2021 at 02:53:06PM -0700, Yang Shi wrote:
> > Yang Shi (5):
> > mm: hwpoison: remove the unnecessary THP check
> > mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
> > mm: hwpoison: refactor refcount check handling
> > mm: shmem: don't truncate page if memory failure happens
> > mm: hwpoison: handle non-anonymous THP correctly
>
> Today I just noticed one more thing: unpoison path has (unpoison_memory):
>
> if (page_mapping(page)) {
> unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n",
> pfn, &unpoison_rs);
> return 0;
> }
>
> I _think_ it was used to make sure we ignore page that was not successfully
> poisoned/offlined before (for anonymous), so raising this question up on
> whether we should make sure e.g. shmem hwpoisoned pages still can be unpoisoned
> for debugging purposes.

Yes, not only mapping, the refcount check is not right if page cache
page is kept in page cache instead of being truncated after this
series. But actually unpoison has been broken since commit
0ed950d1f28142ccd9a9453c60df87853530d778 ("mm,hwpoison: make
get_hwpoison_page() call get_any_page()"). And Naoya said in the
commit "unpoison_memory() is also unchanged because it's broken and
need thorough fixes (will be done later)."

I do have some fixes in my tree to unblock tests and fix unpoison for
this series (just make it work for testing). Naoya may have some ideas
in mind and it is just a debugging feature so I don't think it must be
fixed in this series. It could be done later. I could add a TODO
section in the cover letter to make this more clear.

>
> --
> Peter Xu
>