poisoned pages do not play well in the buddy allocator
From: Oscar Salvador
Date: Mon Aug 26 2019 - 06:41:56 EST
Hi,
When analyzing a problem reported by one of our customers, I stumbbled upon an issue
that origins from the fact that poisoned pages end up in the buddy allocator.
Let me break down the stepts that lie to the problem:
1) We soft-offline a page
2) Page gets flagged as HWPoison and is being sent to the buddy allocator.
This is done through set_hwpoison_free_buddy_page().
3) Kcompactd wakes up in order to perform some compaction.
4) compact_zone() will call migrate_pages()
5) migrate_pages() will try to get a new page from compaction_alloc() to migrate to
6) if cc->freelist is empty, compaction_alloc() will call isolate_free_pagesblock()
7) isolate_free_pagesblock only checks for PageBuddy() to assume that a page is OK
to be used to migrate to. Since HWPoisoned page are also PageBuddy, we add
the page to the list. (same problem exists in fast_isolate_freepages()).
The outcome of that is that we end up happily handing poisoned pages in compaction_alloc,
so if we ever got a fault on that page through *_fault, we will return VM_FAULT_HWPOISON,
and the process will be killed.
I first though that I could get away with it by checking PageHWPoison in
{fast_isolate_freepages/isolate_free_pagesblock}, but not really.
It might be that the page we are checking is an order > 0 page, so the first page
might not be poisoned, but the one the follows might be, and we end up in the
same situation.
After some more thought, I really came to the conclusion that HWPoison pages should not
really be in the buddy allocator, as this is only asking for problems.
In this case it is only compaction code, but it could be happening somewhere else,
and one would expect that the pages you got from the buddy allocator are __ready__ to use.
I __think__ that we thought we were safe to put HWPoison pages in the buddy allocator as we
perform healthy checks when getting a page from there, so we skip poisoned pages
Of course, this is not the end of the story, now that someone got a page, if he frees it,
there is a high chance that this page ends up in a pcplist (I saw that).
Unless we are on CONFIG_VM_DEBUG, we do not check for the health of pages got from pcplist,
as we do when getting a page from the buddy allocator.
I checked [1], and it seems that [2] was going towards fixing this kind of issue.
I think it is about time to revamp the whole thing.
@Naoya: I could give it a try if you are busy.
[1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@xxxxxxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/1541746035-13408-9-git-send-email-n-horiguchi@xxxxxxxxxxxxx/
--
Oscar Salvador
SUSE L3