Re: [PATCH v2 4/8] mm/isolation: close the two race problems related to pageblock isolation

From: Vlastimil Babka
Date: Thu Aug 07 2014 - 10:34:51 EST


On 08/06/2014 09:18 AM, Joonsoo Kim wrote:
We got migratetype of the freeing page without holding the zone lock so
it could be racy. There are two cases of this race.

1. pages are added to isolate buddy list after restoring original
migratetype.
2. pages are added to normal buddy list while pageblock is isolated.

If case 1 happens, we can't allocate freepages on isolate buddy list
until next pageblock isolation occurs.
In case of 2, pages could be merged with pages on isolate buddy list and
located on normal buddy list. This makes freepage counting incorrect
and break the property of pageblock isolation.

One solution to this problem is checking pageblock migratetype with
holding zone lock in __free_one_page() and I posted it before, but,
it didn't get welcome since it needs the hook in zone lock critical
section on freepath.

This is another solution to this problem and impose most overhead on
pageblock isolation logic. Following is how this solution works.

1. Extends irq disabled period on freepath to call
get_pfnblock_migratetype() with irq disabled. With this, we can be
sure that future freed pages will see modified pageblock migratetype
after certain synchronization point so we don't need to hold the zone
lock to get correct pageblock migratetype. Although it extends irq
disabled period on freepath, I guess it is marginal and better than
adding the hook in zone lock critical section.

2. #1 requires IPI for synchronization and we can't hold the zone lock

It would be better to explain here that the synchronization point is pcplists draining.

during processing IPI. In this time, some pages could be moved from buddy
list to pcp list on page allocation path and later it could be moved again
from pcp list to buddy list. In this time, this page would be on isolate

It is difficult to understand the problem just by reading this. I guess the timelines you included while explaining the problem to me, would help here :)

pageblock, so, the hook is required on free_pcppages_bulk() to prevent

More clearly, a recheck for pageblock's migratetype would be needed in free_pcppages_bulk(), which would again impose overhead outside isolation.

misplacement. To remove this possibility, disabling and draining pcp
list is needed during isolation. It guaratees that there is no page on pcp
list on all cpus while isolation, so misplacement problem can't happen.

Note that this doesn't fix freepage counting problem. To fix it,
we need more logic. Following patches will do it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
---
mm/internal.h | 2 ++
mm/page_alloc.c | 27 ++++++++++++++++++++-------
mm/page_isolation.c | 45 +++++++++++++++++++++++++++++++++------------
3 files changed, 55 insertions(+), 19 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a1b651b..81b8884 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -108,6 +108,8 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
/*
* in mm/page_alloc.c
*/
+extern void zone_pcp_disable(struct zone *zone);
+extern void zone_pcp_enable(struct zone *zone);
extern void __free_pages_bootmem(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e1e344..4517b1d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -726,11 +726,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
- if (likely(!is_migrate_isolate_page(page))) {
- __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
- if (is_migrate_cma(mt))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
- }
+ __mod_zone_freepage_state(zone, 1, mt);

Could be worth mentioning that this can now be removed as it was an incomplete attempt to fix freepage counting, but didn't address the misplacement.

} while (--to_free && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);
@@ -789,8 +785,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
if (!free_pages_prepare(page, order))
return;

- migratetype = get_pfnblock_migratetype(page, pfn);
local_irq_save(flags);
+ migratetype = get_pfnblock_migratetype(page, pfn);
__count_vm_events(PGFREE, 1 << order);
set_freepage_migratetype(page, migratetype);
free_one_page(page_zone(page), page, pfn, order, migratetype);
@@ -1410,9 +1406,9 @@ void free_hot_cold_page(struct page *page, bool cold)
if (!free_pages_prepare(page, 0))
return;

+ local_irq_save(flags);
migratetype = get_pfnblock_migratetype(page, pfn);
set_freepage_migratetype(page, migratetype);
- local_irq_save(flags);
__count_vm_event(PGFREE);

Maybe add comments to these two to make it clear that this cannot be moved outside of the irq disabled part, in case anyone considers it (again) in the future?

@@ -55,20 +56,32 @@ int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages)
*/

out:
- if (!ret) {
- unsigned long nr_pages;
- int migratetype = get_pageblock_migratetype(page);
+ if (ret) {
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return ret;
+ }
on pcplists
- set_pageblock_migratetype(page, MIGRATE_ISOLATE);
- nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
+ migratetype = get_pageblock_migratetype(page);
+ set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+ spin_unlock_irqrestore(&zone->lock, flags);

- __mod_zone_freepage_state(zone, -nr_pages, migratetype);
- }
+ zone_pcp_disable(zone);
+
+ /*
+ * After this point, freed pages will see MIGRATE_ISOLATE as
+ * their pageblock migratetype on all cpus. And pcp list has
+ * no free page.
+ */
+ on_each_cpu(drain_local_pages, NULL, 1);

Is there any difference between drain_all_pages() and this, or why didn't you use drain_all_pages()?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/