Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand

From: Mel Gorman
Date: Wed Sep 30 2015 - 04:28:26 EST


On Tue, Sep 29, 2015 at 02:01:41PM -0700, Andrew Morton wrote:
> > ...
> >
> > +/*
> > + * Reserve a pageblock for exclusive use of high-order atomic allocations if
> > + * there are no empty page blocks that contain a page with a suitable order
> > + */
> > +static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
> > + unsigned int alloc_order)
> > +{
> > + int mt;
> > + unsigned long max_managed, flags;
> > +
> > + /*
> > + * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
> > + * Check is race-prone but harmless.
> > + */
> > + max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
> > + if (zone->nr_reserved_highatomic >= max_managed)
> > + return;
> > +
> > + /* Yoink! */
> > + spin_lock_irqsave(&zone->lock, flags);
> > +
> > + mt = get_pageblock_migratetype(page);
> > + if (mt != MIGRATE_HIGHATOMIC &&
> > + !is_migrate_isolate(mt) && !is_migrate_cma(mt)) {
>
> Do the above checks really need to be inside zone->lock? I don't think
> get_pageblock_migratetype() needs zone->lock? (Actually I suspect it
> does, but we don't...)
>

The get_pageblock_migratetype does not require zone->lock but it's race-prone
without it and there have been cases (CMA, isolation) that cared. In this
case, without the lock two parallel allocations may try to reserve the same
block so we'd have to recheck the type under the lock to avoid corrupting
nr_reserved_highatomic. As the move between free lists absolutely requires
the zone->lock, it's best to just do the full operation under the lock.

> > + zone->nr_reserved_highatomic += pageblock_nr_pages;
>
> And I don't think it would hurt to recheck
> nr_reserved_highatomic>=max_managed after taking zone->lock, to plug
> that race. We've had VM we-dont-care races in the past which ended up
> causing problems in rare circumstances...
>

That makes sense, patch is below.

> > + set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> > + move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
> > + }
> > + spin_unlock_irqrestore(&zone->lock, flags);
> > +}
> > +
> > +/*
> > + * Used when an allocation is about to fail under memory pressure. This
> > + * potentially hurts the reliability of high-order allocations when under
> > + * intense memory pressure but failed atomic allocations should be easier
> > + * to recover from than an OOM.
> > + */
> > +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> > +{
> > + struct zonelist *zonelist = ac->zonelist;
> > + unsigned long flags;
> > + struct zoneref *z;
> > + struct zone *zone;
> > + struct page *page;
> > + int order;
> > +
> > + for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> > + ac->nodemask) {
> > + /* Preserve at least one pageblock */
> > + if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> > + continue;
> > +
> > + spin_lock_irqsave(&zone->lock, flags);
> > + for (order = 0; order < MAX_ORDER; order++) {
> > + struct free_area *area = &(zone->free_area[order]);
> > +
> > + if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> > + continue;
> > +
> > + page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> > + struct page, lru);
> > +
> > + zone->nr_reserved_highatomic -= pageblock_nr_pages;
>
> So if the race happened here, zone->nr_reserved_highatomic underflows?
>

It shouldn't. If there are entries on the MIGRATE_HIGHATOMIC list then
it should be accounted for in nr_reserved_highatomic. However, I see your
point as a spill from per-cpu lists has caused us problems in the past.

---8<---
From: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Subject: [PATCH] mm, page_alloc: Reserve pageblocks for high-order atomic
allocations on demand -fix

nr_reserved_highatomic is checked outside the zone lock so there is a race
whereby the reserve is larger than the limit allows. This patch rechecks
the count under the zone lock.

During unreserving, there is a possibility we could underflow if there
ever was a race between per-cpu drains, reserve and unreserving. This
patch adds a comment about the potential race and protects against it.

These are two fixes to the mmotm patch
mm-page_alloc-reserve-pageblocks-for-high-order-atomic-allocations-on-demand.patch .
They are not separate patches and they should all be folded together.

Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
---
mm/page_alloc.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 811d6fc4ad5d..b1892dc51b55 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1633,9 +1633,13 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
if (zone->nr_reserved_highatomic >= max_managed)
return;

- /* Yoink! */
spin_lock_irqsave(&zone->lock, flags);

+ /* Recheck the nr_reserved_highatomic limit under the lock */
+ if (zone->nr_reserved_highatomic >= max_managed)
+ goto out_unlock;
+
+ /* Yoink! */
mt = get_pageblock_migratetype(page);
if (mt != MIGRATE_HIGHATOMIC &&
!is_migrate_isolate(mt) && !is_migrate_cma(mt)) {
@@ -1643,6 +1647,8 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
}
+
+out_unlock:
spin_unlock_irqrestore(&zone->lock, flags);
}

@@ -1677,7 +1683,14 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
struct page, lru);

- zone->nr_reserved_highatomic -= pageblock_nr_pages;
+ /*
+ * It should never happen but changes to locking could
+ * inadvertently allow a per-cpu drain to add pages
+ * to MIGRATE_HIGHATOMIC while unreserving so be safe
+ * and watch for underflows.
+ */
+ zone->nr_reserved_highatomic -= min(pageblock_nr_pages,
+ zone->nr_reserved_highatomic);

/*
* Convert to ac->migratetype and avoid the normal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/