Re: [PATCH v17 4/9] mm: Introduce Reported pages

From: Mel Gorman
Date: Wed Feb 19 2020 - 09:55:18 EST


On Tue, Feb 11, 2020 at 02:46:35PM -0800, Alexander Duyck wrote:
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> new file mode 100644
> index 000000000000..1047c6872d4f
> --- /dev/null
> +++ b/mm/page_reporting.c
> @@ -0,0 +1,319 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/mm.h>
> +#include <linux/mmzone.h>
> +#include <linux/page_reporting.h>
> +#include <linux/gfp.h>
> +#include <linux/export.h>
> +#include <linux/delay.h>
> +#include <linux/scatterlist.h>
> +
> +#include "page_reporting.h"
> +#include "internal.h"
> +
> +#define PAGE_REPORTING_DELAY (2 * HZ)

I assume there is nothing special about 2 seconds other than "do some
progress every so often".

> +static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
> +
> +enum {
> + PAGE_REPORTING_IDLE = 0,
> + PAGE_REPORTING_REQUESTED,
> + PAGE_REPORTING_ACTIVE
> +};
> +
> +/* request page reporting */
> +static void
> +__page_reporting_request(struct page_reporting_dev_info *prdev)
> +{
> + unsigned int state;
> +
> + /* Check to see if we are in desired state */
> + state = atomic_read(&prdev->state);
> + if (state == PAGE_REPORTING_REQUESTED)
> + return;
> +
> + /*
> + * If reporting is already active there is nothing we need to do.
> + * Test against 0 as that represents PAGE_REPORTING_IDLE.
> + */
> + state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED);
> + if (state != PAGE_REPORTING_IDLE)
> + return;
> +
> + /*
> + * Delay the start of work to allow a sizable queue to build. For
> + * now we are limiting this to running no more than once every
> + * couple of seconds.
> + */
> + schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
> +}

Seems a fair use of atomics.

> +static int
> +page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
> + unsigned int order, unsigned int mt,
> + struct scatterlist *sgl, unsigned int *offset)
> +{
> + struct free_area *area = &zone->free_area[order];
> + struct list_head *list = &area->free_list[mt];
> + unsigned int page_len = PAGE_SIZE << order;
> + struct page *page, *next;
> + int err = 0;
> +
> + /*
> + * Perform early check, if free area is empty there is
> + * nothing to process so we can skip this free_list.
> + */
> + if (list_empty(list))
> + return err;
> +
> + spin_lock_irq(&zone->lock);
> +
> + /* loop through free list adding unreported pages to sg list */
> + list_for_each_entry_safe(page, next, list, lru) {
> + /* We are going to skip over the reported pages. */
> + if (PageReported(page))
> + continue;
> +
> + /* Attempt to pull page from list */
> + if (!__isolate_free_page(page, order))
> + break;
> +

Might want to note that you are breaking because the only reason to fail
the isolation is that watermarks are not met and we are likely under
memory pressure. It's not a big issue.

However, while I think this is correct, it's hard to follow. This loop can
be broken out of with pages still on the scatter gather list. The current
flow guarantees that err will not be set at this point so the caller
cleans it up so we always drain the list either here or in the caller.

While I think it works, it's a bit fragile. I recommend putting a comment
above this noting why it's safe and put a VM_WARN_ON_ONCE(err) before the
break in case someone tries to change this in a years time and does not
spot that the flow to reach page_reporting_drain *somewhere* is critical.

> + /* Add page to scatter list */
> + --(*offset);
> + sg_set_page(&sgl[*offset], page, page_len, 0);
> +
> + /* If scatterlist isn't full grab more pages */
> + if (*offset)
> + continue;
> +
> + /* release lock before waiting on report processing */
> + spin_unlock_irq(&zone->lock);
> +
> + /* begin processing pages in local list */
> + err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
> +
> + /* reset offset since the full list was reported */
> + *offset = PAGE_REPORTING_CAPACITY;
> +
> + /* reacquire zone lock and resume processing */
> + spin_lock_irq(&zone->lock);
> +
> + /* flush reported pages from the sg list */
> + page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
> +
> + /*
> + * Reset next to first entry, the old next isn't valid
> + * since we dropped the lock to report the pages
> + */
> + next = list_first_entry(list, struct page, lru);
> +
> + /* exit on error */
> + if (err)
> + break;
> + }
> +
> + spin_unlock_irq(&zone->lock);
> +
> + return err;
> +}

I complained about the use of zone lock before but in this version, I
think I'm ok with it. The lock is held for the free list manipulations
which is what it's for. The state management with atomics seems
reasonable.

Otherwise I think this is ok and I think the implementation right. Of
great importance to me was the allocator fast paths but they seem to be
adequately protected by a static branch so

Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

The ack applies regardless of whether you decide to document and
defensively protect page_reporting_cycle against losing pages on the
scatter/gather list but I do recommend it.

--
Mel Gorman
SUSE Labs