Re: [PATCH v3 4/6] mm: Introduce Reported pages

From: Alexander Duyck
Date: Mon Aug 05 2019 - 11:11:55 EST


On Mon, Aug 5, 2019 at 7:05 AM Nitesh Narayan Lal <nitesh@xxxxxxxxxx> wrote:
>
>
> On 8/1/19 6:33 PM, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx>
> >
> > In order to pave the way for free page reporting in virtualized
> > environments we will need a way to get pages out of the free lists and
> > identify those pages after they have been returned. To accomplish this,
> > this patch adds the concept of a Reported Buddy, which is essentially
> > meant to just be the Uptodate flag used in conjunction with the Buddy
> > page type.
> >
> > It adds a set of pointers we shall call "boundary" which represents the
> > upper boundary between the unreported and reported pages. The general idea
> > is that in order for a page to cross from one side of the boundary to the
> > other it will need to go through the reporting process. Ultimately a
> > free_list has been fully processed when the boundary has been moved from
> > the tail all they way up to occupying the first entry in the list.
> >
> > Doing this we should be able to make certain that we keep the reported
> > pages as one contiguous block in each free list. This will allow us to
> > efficiently manipulate the free lists whenever we need to go in and start
> > sending reports to the hypervisor that there are new pages that have been
> > freed and are no longer in use.
> >
> > An added advantage to this approach is that we should be reducing the
> > overall memory footprint of the guest as it will be more likely to recycle
> > warm pages versus trying to allocate the reported pages that were likely
> > evicted from the guest memory.
> >
> > Since we will only be reporting one zone at a time we keep the boundary
> > limited to being defined for just the zone we are currently reporting pages
> > from. Doing this we can keep the number of additional pointers needed quite
> > small. To flag that the boundaries are in place we use a single bit
> > in the zone to indicate that reporting and the boundaries are active.
> >
> > The determination of when to start reporting is based on the tracking of
> > the number of free pages in a given area versus the number of reported
> > pages in that area. We keep track of the number of reported pages per
> > free_area in a separate zone specific area. We do this to avoid modifying
> > the free_area structure as this can lead to false sharing for the highest
> > order with the zone lock which leads to a noticeable performance
> > degradation.
> >
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx>
> > ---
> > include/linux/mmzone.h | 40 +++++
> > include/linux/page-flags.h | 11 +
> > include/linux/page_reporting.h | 138 ++++++++++++++++++
> > mm/Kconfig | 5 +
> > mm/Makefile | 1
> > mm/memory_hotplug.c | 1
> > mm/page_alloc.c | 136 ++++++++++++++++++
> > mm/page_reporting.c | 299 ++++++++++++++++++++++++++++++++++++++++
> > 8 files changed, 623 insertions(+), 8 deletions(-)
> > create mode 100644 include/linux/page_reporting.h
> > create mode 100644 mm/page_reporting.c
> >

<snip>

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 71aadc7d5ff6..69b848e5b83f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -68,6 +68,7 @@
> > #include <linux/lockdep.h>
> > #include <linux/nmi.h>
> > #include <linux/psi.h>
> > +#include <linux/page_reporting.h>
> >
> > #include <asm/sections.h>
> > #include <asm/tlbflush.h>
> > @@ -915,7 +916,7 @@ static inline struct capture_control *task_capc(struct zone *zone)
> > static inline void __free_one_page(struct page *page,
> > unsigned long pfn,
> > struct zone *zone, unsigned int order,
> > - int migratetype)
> > + int migratetype, bool reported)
> > {
> > struct capture_control *capc = task_capc(zone);
> > unsigned long uninitialized_var(buddy_pfn);
> > @@ -990,11 +991,20 @@ static inline void __free_one_page(struct page *page,
> > done_merging:
> > set_page_order(page, order);
> >
> > - if (is_shuffle_order(order) ? shuffle_add_to_tail() :
> > - buddy_merge_likely(pfn, buddy_pfn, page, order))
> > + if (reported ||
> > + (is_shuffle_order(order) ? shuffle_add_to_tail() :
> > + buddy_merge_likely(pfn, buddy_pfn, page, order)))
> > add_to_free_list_tail(page, zone, order, migratetype);
> > else
> > add_to_free_list(page, zone, order, migratetype);
> > +
> > + /*
> > + * No need to notify on a reported page as the total count of
> > + * unreported pages will not have increased since we have essentially
> > + * merged the reported page with one or more unreported pages.
> > + */
> > + if (!reported)
> > + page_reporting_notify_free(zone, order);
> > }
> >
> > /*
> > @@ -1305,7 +1315,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > if (unlikely(isolated_pageblocks))
> > mt = get_pageblock_migratetype(page);
> >
> > - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > + __free_one_page(page, page_to_pfn(page), zone, 0, mt, false);
> > trace_mm_page_pcpu_drain(page, 0, mt);
> > }
> > spin_unlock(&zone->lock);
> > @@ -1321,7 +1331,7 @@ static void free_one_page(struct zone *zone,
> > is_migrate_isolate(migratetype))) {
> > migratetype = get_pfnblock_migratetype(page, pfn);
> > }
> > - __free_one_page(page, pfn, zone, order, migratetype);
> > + __free_one_page(page, pfn, zone, order, migratetype, false);
> > spin_unlock(&zone->lock);
> > }
> >
> > @@ -2183,6 +2193,122 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> > return NULL;
> > }
> >
> > +#ifdef CONFIG_PAGE_REPORTING
> > +/**
> > + * get_unreported_page - Pull an unreported page from the free_list
> > + * @zone: Zone to draw pages from
> > + * @order: Order to draw pages from
> > + * @mt: Migratetype to draw pages from
> > + *
> > + * This function will obtain a page from the free list. It will start by
> > + * attempting to pull from the tail of the free list and if that is already
> > + * reported on it will instead pull the head if that is unreported.
> > + *
> > + * The page will have the migrate type and order stored in the page
> > + * metadata. While being processed the page will not be avaialble for
> > + * allocation.
> > + *
> > + * Return: page pointer if raw page found, otherwise NULL
> > + */
> > +struct page *get_unreported_page(struct zone *zone, unsigned int order, int mt)
> > +{
> > + struct list_head *tail = get_unreported_tail(zone, order, mt);
> > + struct free_area *area = &(zone->free_area[order]);
> > + struct list_head *list = &area->free_list[mt];
> > + struct page *page;
> > +
> > + /* zone lock should be held when this function is called */
> > + lockdep_assert_held(&zone->lock);
> > +
> > + /* Find a page of the appropriate size in the preferred list */
> > + page = list_last_entry(tail, struct page, lru);
> > + list_for_each_entry_from_reverse(page, list, lru) {
> > + /* If we entered this loop then the "raw" list isn't empty */
> > +
> > + /* If the page is reported try the head of the list */
> > + if (PageReported(page)) {
> > + page = list_first_entry(list, struct page, lru);
> > +
> > + /*
> > + * If both the head and tail are reported then reset
> > + * the boundary so that we read as an empty list
> > + * next time and bail out.
> > + */
> > + if (PageReported(page)) {
> > + page_reporting_add_to_boundary(page, zone, mt);
> > + break;
> > + }
> > + }
> > +
> > + del_page_from_free_list(page, zone, order);
> > +
> > + /* record migratetype and order within page */
> > + set_pcppage_migratetype(page, mt);
> > + set_page_private(page, order);
> > +
> > + /*
> > + * Page will not be available for allocation while we are
> > + * processing it so update the freepage state.
> > + */
> > + __mod_zone_freepage_state(zone, -(1 << order), mt);
> > +
> > + return page;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +/**
> > + * put_reported_page - Return a now-reported page back where we got it
> > + * @zone: Zone to return pages to
> > + * @page: Page that was reported
> > + *
> > + * This function will pull the migratetype and order information out
> > + * of the page and attempt to return it where it found it. If the page
> > + * is added to the free list without changes we will mark it as being
> > + * reported.
> > + */
> > +void put_reported_page(struct zone *zone, struct page *page)
> > +{
> > + unsigned int order, mt;
> > + unsigned long pfn;
> > +
> > + /* zone lock should be held when this function is called */
> > + lockdep_assert_held(&zone->lock);
> > +
> > + mt = get_pcppage_migratetype(page);
> > + pfn = page_to_pfn(page);
> > +
> > + if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt))) {
> > + mt = get_pfnblock_migratetype(page, pfn);
> > + set_pcppage_migratetype(page, mt);
> > + }
> > +
> > + order = page_private(page);
> > + set_page_private(page, 0);
> > +
> > + __free_one_page(page, pfn, zone, order, mt, true);
>
> I don't think we need to hold the zone lock for fetching migratetype and other
> information.
> We can save some lock held time by acquiring and releasing zone lock before and
> after __free_one_page() respectively. Isn't?

We could, but acquiring and releasing the lock also takes time. I
thought it better to simply hold the lock while I dump the scatterlist
back into the free_list, and until I have completed pulling the
non-reported pages back out. Otherwise we take the overhead for
acquiring/releasing the spinlock itself which isn't necessarily cheap
since it will be frequently bounced between CPUs.

> > +
> > + /*
> > + * If page was comingled with another page we cannot consider
> > + * the result to be "reported" since part of the page hasn't been.
> > + * In this case we will simply exit and not update the "reported"
> > + * state. Instead just treat the result as a unreported page.
> > + */
> > + if (!PageBuddy(page) || page_order(page) != order)
> > + return;
> > +
> > + /* update areated page accounting */
> > + zone->reported_pages[order - PAGE_REPORTING_MIN_ORDER]++;
> > +
> > + /* update boundary of new migratetype and record it */
> > + page_reporting_add_to_boundary(page, zone, mt);
> > +
> > + /* flag page as reported */
> > + __SetPageReported(page);
> > +}
> > +#endif /* CONFIG_PAGE_REPORTING */
> > +
> > /*
> > * This array describes the order lists are fallen back to when
> > * the free lists for the desirable migrate type are depleted
> > diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> > new file mode 100644
> > index 000000000000..971138205ae5
> > --- /dev/null
> > +++ b/mm/page_reporting.c
> > @@ -0,0 +1,299 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/mm.h>
> > +#include <linux/mmzone.h>
> > +#include <linux/page-isolation.h>
> > +#include <linux/gfp.h>
> > +#include <linux/export.h>

<snip>

> > +int page_reporting_startup(struct page_reporting_dev_info *phdev)
> > +{
> > + struct zone *zone;
> > +
> > + /* nothing to do if already in use */
> > + if (rcu_access_pointer(ph_dev_info))
> > + return -EBUSY;
> > +
> > + /* allocate scatterlist to store pages being reported on */
> > + phdev->sg = kcalloc(phdev->capacity, sizeof(*phdev->sg), GFP_KERNEL);
> > + if (!phdev->sg)
> > + return -ENOMEM;
> > +
> > + /* initialize refcnt and work structures */
> > + atomic_set(&phdev->refcnt, 0);
> > + INIT_DELAYED_WORK(&phdev->work, &page_reporting_process);
> > +
> > + /* assign device, and begin initial flush of populated zones */
> > + rcu_assign_pointer(ph_dev_info, phdev);
>
>
> Will, it not make sense to do this at the top after rcu_access_pointer check()?
> Otherwise, there could be a race between two enablers. Am I missing something here?

Placement wouldn't matter as a race would still be possible. Right now
this is safe since there is really only one consumer for this. However
I suppose I should look at adding a mutex so that we cannot have
multiple threads doing the initialization at the same time.

> > + for_each_populated_zone(zone) {
> > + spin_lock(&zone->lock);
> > + __page_reporting_request(zone);
> > + spin_unlock(&zone->lock);
> > + }
> > +
> > + /* enable page reporting notification */
> > + static_key_slow_inc(&page_reporting_notify_enabled);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(page_reporting_startup);
> > +
> >
> --
> Thanks
> Nitesh
>