Re: [PATCH 3/3] mm, page_allocator: Only use per-cpu allocator for irq-safe requests

From: Vlastimil Babka
Date: Thu Jan 12 2017 - 12:05:39 EST


On 01/12/2017 11:43 AM, Mel Gorman wrote:
Many workloads that allocate pages are not handling an interrupt at a
time. As allocation requests may be from IRQ context, it's necessary to
disable/enable IRQs for every page allocation. This cost is the bulk
of the free path but also a significant percentage of the allocation
path.

This patch alters the locking and checks such that only irq-safe allocation
requests use the per-cpu allocator. All others acquire the irq-safe
zone->lock and allocate from the buddy allocator. It relies on disabling
preemption to safely access the per-cpu structures. It could be slightly
modified to avoid soft IRQs using it but it's not clear it's worthwhile.

This modification may slow allocations from IRQ context slightly but the main
gain from the per-cpu allocator is that it scales better for allocations
from multiple contexts. There is an implicit assumption that intensive
allocations from IRQ contexts on multiple CPUs from a single NUMA node are
rare and that the fast majority of scaling issues are encountered in !IRQ
contexts such as page faulting. It's worth noting that this patch is not
required for a bulk page allocator but it significantly reduces the overhead.

The following is results from a page allocator micro-benchmark. Only
order-0 is interesting as higher orders do not use the per-cpu allocator


<snip nice results>


Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Acked-by: Hillf Danton <hillf.zj@xxxxxxxxxxxxxxx>
Acked-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>

Very promising! But I have some worries. Should we put something like VM_BUG_ON(in_interrupt()) into free_hot_cold_page() and rmqueue_pcplist() to catch future potential misuses and also document this requirement? Also free_hot_cold_page() has other call sites besides __free_pages() and I'm not sure if those are all guaranteed to be !IRQ? E.g. free_hot_cold_page_list() which is called by release_page() which uses irq-safe lock operations...

Smaller nit below:

@@ -2453,8 +2450,8 @@ void free_hot_cold_page(struct page *page, bool cold)

migratetype = get_pfnblock_migratetype(page, pfn);
set_pcppage_migratetype(page, migratetype);
- local_irq_save(flags);
- __count_vm_event(PGFREE);
+ preempt_disable();
+ count_vm_event(PGFREE);

AFAICS preempt_disable() is enough for using __count_vm_event(), no?

@@ -2647,9 +2644,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
struct list_head *list;
bool cold = ((gfp_flags & __GFP_COLD) != 0);
struct page *page;
- unsigned long flags;

- local_irq_save(flags);
+ preempt_disable();
pcp = &this_cpu_ptr(zone->pageset)->pcp;
list = &pcp->lists[migratetype];
page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype,
@@ -2658,7 +2654,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);

But if I'm wrong above, then this __count should be converted too?

zone_statistics(preferred_zone, zone, gfp_flags);
}
- local_irq_restore(flags);
+ preempt_enable_no_resched();
return page;
}