Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

From: Mel Gorman
Date: Mon Jan 16 2017 - 10:01:25 EST


On Mon, Jan 16, 2017 at 03:25:18PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 9 Jan 2017 16:35:18 +0000
> Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> > number of pages to be allocated and added to a list. They can be freed in
> > bulk using free_pages_bulk(). Note that it would theoretically be possible
> > to use free_hot_cold_page_list for faster frees if the symbol was exported,
> > the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> > This would be significantly faster in the free path but also more unsafer
> > and a harder API to use.
> >
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch if
> > necessary.
> >
> > The following compares the allocation cost per page for different batch
> > sizes. The baseline is allocating them one at a time and it compares with
> > the performance when using the new allocation interface.
>
> I've also played with testing the bulking API here:
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
>
> My baseline single (order-0 page) show: 158 cycles(tsc) 39.593 ns
>
> Using bulking API:
> Bulk: 1 cycles: 128 nanosec: 32.134
> Bulk: 2 cycles: 107 nanosec: 26.783
> Bulk: 3 cycles: 100 nanosec: 25.047
> Bulk: 4 cycles: 95 nanosec: 23.988
> Bulk: 8 cycles: 91 nanosec: 22.823
> Bulk: 16 cycles: 88 nanosec: 22.093
> Bulk: 32 cycles: 85 nanosec: 21.338
> Bulk: 64 cycles: 85 nanosec: 21.315
> Bulk: 128 cycles: 84 nanosec: 21.214
> Bulk: 256 cycles: 115 nanosec: 28.979
>
> This bulk API (and other improvements part of patchset) definitely
> moves the speed of the page allocator closer to my (crazy) time budget
> target of between 201 to 269 cycles per packet[1]. Remember I was
> reporting[2] order-0 cost between 231 to 277 cycles, at MM-summit
> 2016, so this is a huge improvement since then.
>

Good to hear.

> The bulk numbers are great, but it still cannot compete with the
> recycles tricks used by drivers. Looking at the code (and as Mel also
> mentions) there is room for improvements especially on the bulk free-side.
>

A major component there is how the ref handling is done and the safety
checks. If necessary, you could mandate that callers drop the reference
count or allow pages to be freed with an elevated count to avoid the atomic
ops. In an early prototype, I made the refcount "mistake" and freeing was
half the cost. I restored it in the final version to have an API that was
almost identical to the existing allocator other than the bulking aspects.

You could also disable all the other safety checks and flag that the bulk
alloc/free potentially frees pages in inconsistent state. That would
increase the performance at the cost of safety but that may be acceptable
given that driver recycling of pages also avoids the same checks.

You could also consider disabling the statistics updates to avoid a bunch
of per-cpu stat operations, particularly if the pages were mostly recycled
by the generic pool allocator.

--
Mel Gorman
SUSE Labs