Re: [RFC v4 PATCH 3/5] mm/rmqueue_bulk: alloc without touching individual page structure

From: Aaron Lu
Date: Mon Oct 22 2018 - 22:19:37 EST

Next message: David Miller: "Re: [PATCH v3] isdn: hfc_{pci,sx}: Avoid empty body if statements"
Previous message: Andy Lutomirski: "Re: [PATCH v2 1/5] x86/vdso: Renames variable to fix shadow warning."
In reply to: Vlastimil Babka: "Re: [RFC v4 PATCH 3/5] mm/rmqueue_bulk: alloc without touching individual page structure"
Next in thread: Aaron Lu: "[RFC v4 PATCH 4/5] mm/free_pcppages_bulk: reduce overhead of cluster operation on free path"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Oct 22, 2018 at 11:37:53AM +0200, Vlastimil Babka wrote:
> On 10/17/18 8:33 AM, Aaron Lu wrote:
> > Profile on Intel Skylake server shows the most time consuming part
> > under zone->lock on allocation path is accessing those to-be-returned
> > page's "struct page" on the free_list inside zone->lock. One explanation
> > is, different CPUs are releasing pages to the head of free_list and
> > those page's 'struct page' may very well be cache cold for the allocating
> > CPU when it grabs these pages from free_list' head. The purpose here
> > is to avoid touching these pages one by one inside zone->lock.
>
> What about making the pages cache-hot first, without zone->lock, by
> traversing via page->lru. It would need some safety checks obviously
> (maybe based on page_to_pfn + pfn_valid, or something) to make sure we
> only read from real struct pages in case there's some update racing. The
> worst case would be not populating enough due to race, and thus not
> gaining the performance when doing the actual rmqueueing under lock.

Yes, there are the 2 potential problems you have pointed out:
1 we may be prefetching something that isn't a page due to page->lru can
be reused as different things under different scenerios;
2 we may not be able to prefetch much due to other CPU is doing
allocation inside the lock, it's possible we end up with prefetching
pages that are on another CPU's pcp list.

Considering the above 2 problems, I feel prefetching outside lock a
little risky and troublesome.

Allocation path is the hard part of improving page allocator
performance - in free path, we can prefetch them safely outside the lock
and we can even pre-merge them outside the lock to reduce the pressure of
the zone lock; but in allocation path, there is pretty nothing we can do
before acquiring the lock, except taking the risk to prefetch them
without taking the lock as you mentioned here.

We can come back to this if 'address space range' lock doesn't work out.

Next message: David Miller: "Re: [PATCH v3] isdn: hfc_{pci,sx}: Avoid empty body if statements"
Previous message: Andy Lutomirski: "Re: [PATCH v2 1/5] x86/vdso: Renames variable to fix shadow warning."
In reply to: Vlastimil Babka: "Re: [RFC v4 PATCH 3/5] mm/rmqueue_bulk: alloc without touching individual page structure"
Next in thread: Aaron Lu: "[RFC v4 PATCH 4/5] mm/free_pcppages_bulk: reduce overhead of cluster operation on free path"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]