Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned andno pages were isolated

From: Minchan Kim
Date: Tue Sep 25 2012 - 20:46:15 EST


On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
> On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
> > On Mon, 24 Sep 2012 10:39:38 +0100
> > Mel Gorman <mgorman@xxxxxxx> wrote:
> >
> > > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> > >
> > > > Also, what has to be done to avoid the polling altogether? eg/ie, zap
> > > > a pageblock's PB_migrate_skip synchronously, when something was done to
> > > > that pageblock which justifies repolling it?
> > > >
> > >
> > > The "something" event you are looking for is pages being freed or
> > > allocated in the page allocator. A movable page being allocated in block
> > > or a page being freed should clear the PB_migrate_skip bit if it's set.
> > > Unfortunately this would impact the fast path of the alloc and free paths
> > > of the page allocator. I felt that that was too high a price to pay.
> >
> > We already do a similar thing in the page allocator: clearing of
> > ->all_unreclaimable and ->pages_scanned.
>
> That is true but that is a simple write (shared cache line but still) to
> a struct zone. Worse, now that you point it out, that's pretty stupid. It
> should be checking if the value is non-zero before writing to it to avoid
> a cache line bounce.
>
> Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
> not as cheap as it needs to
>
> set_pageblock_skip
> -> set_pageblock_flags_group
> -> page_zone
> -> page_to_pfn
> -> get_pageblock_bitmap
> -> pfn_to_bitidx
> -> __set_bit
>
> > But that isn't on the "fast
> > path" really - it happens once per pcp unload.
>
> That's still an important enough path that I'm wary of making it fatter
> and that only covers the free path. To avoid the polling, the allocation
> side needs to be handled too. It could be shoved down into rmqueue() to
> put it into a slightly colder path but still, it's a price to pay to keep
> compaction happy.
>
> > Can we do something
> > like that? Drop some hint into the zone without having to visit each
> > page?
> >
>
> Not without incurring a cost, but yes, t is possible to give a hint on when
> PG_migrate_skip should be cleared and move away from that time-based hammer.
>
> First, we'd introduce a variant of get_pageblock_migratetype() that returns
> all the bits for the pageblock flags and then helpers to extract either the
> migratetype or the PG_migrate_skip. We already are incurring the cost of
> get_pageblock_migratetype() so it will not be much more expensive than what
> is already there. If there is an allocation or free within a pageblock that
> as the PG_migrate_skip bit set then we increment a counter. When the counter
> reaches some to-be-decided "threshold" then compaction may clear all the
> bits. This would match the criteria of the clearing being based on activity.
>
> There are four potential problems with this
>
> 1. The logic to retrieve all the bits and split them up will be a little
> convulated but maybe it would not be that bad.
>
> 2. The counter is a shared-writable cache line but obviously it could
> be moved to vmstat and incremented with inc_zone_page_state to offset
> the cost a little.
>
> 3. The biggested weakness is that there is not way to know if the
> counter is incremented based on activity in a small subset of blocks.
>
> 4. What should the threshold be?
>
> The first problem is minor but the other three are potentially a mess.
> Adding another vmstat counter is bad enough in itself but if the counter
> is incremented based on a small subsets of pageblocks, the hint becomes
> is potentially useless.

Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
in pageblock_flags_group.
In allocation path, we can set PG_check_migrate in a pageblock
In free path, we can set PG_check_free in a pageblock.
And they are cleared by compaction's scan like now.
So we can discard 3 and 4 at least.

Another idea is that let's cure it by fixing fundamental problem.
Make zone's locks more fine-grained.
As time goes by, system uses bigger memory but our lock of zone
isn't scalable. Recently, lru_lock and zone->lock contention report
isn't rare so i think it's good time that we move next step.

How about defining struct sub_zone per 2G or 4G?
so a zone can have several sub_zone as size and subzone can replace
current zone's role and zone is just container of subzones.
Of course, it's not easy to implement but I think someday we should
go that way. Is it a really overkill?

>
> However, does this match what you have in mind or am I over-complicating
> things?
>
> > > > >
> > > > > ...
> > > > >
> > > > > +static void reset_isolation_suitable(struct zone *zone)
> > > > > +{
> > > > > + unsigned long start_pfn = zone->zone_start_pfn;
> > > > > + unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > > > > + unsigned long pfn;
> > > > > +
> > > > > + /*
> > > > > + * Do not reset more than once every five seconds. If allocations are
> > > > > + * failing sufficiently quickly to allow this to happen then continually
> > > > > + * scanning for compaction is not going to help. The choice of five
> > > > > + * seconds is arbitrary but will mitigate excessive scanning.
> > > > > + */
> > > > > + if (time_before(jiffies, zone->compact_blockskip_expire))
> > > > > + return;
> > > > > + zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > > > > +
> > > > > + /* Walk the zone and mark every pageblock as suitable for isolation */
> > > > > + for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > > > + struct page *page;
> > > > > + if (!pfn_valid(pfn))
> > > > > + continue;
> > > > > +
> > > > > + page = pfn_to_page(pfn);
> > > > > + if (zone != page_zone(page))
> > > > > + continue;
> > > > > +
> > > > > + clear_pageblock_skip(page);
> > > > > + }
> > > >
> > > > What's the worst-case loop count here?
> > > >
> > >
> > > zone->spanned_pages >> pageblock_order
> >
> > What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)
>
> Lets take an unlikely case - 128G single-node machine. That loop count
> on x86-64 would be 65536. It'll be fast enough, particularly in this
> path.
>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/