Re: [PATCH] mm/hwpoison: fix race between soft_offline_page and unpoison_memory

From: Wanpeng Li
Date: Fri Aug 14 2015 - 05:01:49 EST

On 8/14/15 4:38 PM, Naoya Horiguchi wrote:
> On Fri, Aug 14, 2015 at 03:59:21PM +0800, Wanpeng Li wrote:
>> On 8/14/15 3:54 PM, Wanpeng Li wrote:
>>> [...]
>>>> OK, then I rethink of handling the race in unpoison_memory().
>>>> Currently properly contained/hwpoisoned pages should have page refcount 1
>>>> (when the memory error hits LRU pages or hugetlb pages) or refcount 0
>>>> (when the memory error hits the buddy page.) And current unpoison_memory()
>>>> implicitly assumes this because otherwise the unpoisoned page has no place
>>>> to go and it's just leaked.
>>>> So to avoid the kernel panic, adding prechecks of refcount and mapcount
>>>> to limit the page to unpoison for only unpoisonable pages looks OK to me.
>>>> The page under soft offlining always has refcount >=2 and/or mapcount > 0,
>>>> so such pages should be filtered out.
>>>> Here's a patch. In my testing (run soft offline stress testing then repeat
>>>> unpoisoning in background,) the reported (or similar) bug doesn't happen.
>>>> Can I have your comments?
>>> As page_action() prints out page maybe still referenced by some users,
>>> however, PageHWPoison has already set. So you will leak many poison pages.
>> Anyway, the bug is still there.
>> [ 944.387559] BUG: Bad page state in process expr pfn:591e3
>> [ 944.393053] page:ffffea00016478c0 count:-1 mapcount:0 mapping:
>> (null) index:0x2
>> [ 944.401147] flags: 0x1fffff80000000()
>> [ 944.404819] page dumped because: nonzero _count
> Hmm, no luck :(
> To investigate more, I'd like to test the exactly same kernel as yours, so
> could you share the kernel info (.config and base kernel and what patches
> you applied)? or pushing your tree somewhere like github?
> # if you like, sending to me privately is fine.
> I think that I tested v4.2-rc6 + <your recent 7 hwpoison patches> +
> "mm/hwpoison: fix race between soft_offline_page and unpoison_memory",
> but I experienced some conflict in applying your patches for some reason,
> so it might happen that we are testing on different kernels.

I don't have special config and tree, the latest mmotm has already
merged my recent 8 hwpoison patches, you can test based on it.

Wanpeng Li

> Mine is here:
> v4.2-rc6/fix_race_soft_offline_unpoison
> Thanks,
> Naoya Horiguchi

