Re: [PATCH 1/5] mm: kswapd: Stop high-order balancing when anysuitable zone is balanced

From: Mel Gorman
Date: Mon Dec 06 2010 - 05:56:21 EST


On Mon, Dec 06, 2010 at 08:35:18AM +0900, Minchan Kim wrote:
> Hi Mel,
>
> On Fri, Dec 3, 2010 at 8:45 PM, Mel Gorman <mel@xxxxxxxxx> wrote:
> > When the allocator enters its slow path, kswapd is woken up to balance the
> > node. It continues working until all zones within the node are balanced. For
> > order-0 allocations, this makes perfect sense but for higher orders it can
> > have unintended side-effects. If the zone sizes are imbalanced, kswapd may
> > reclaim heavily within a smaller zone discarding an excessive number of
> > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > even though plenty of pages are free from a suitable zone.
> >
> > This patch alters the "balance" logic for high-order reclaim allowing kswapd
> > to stop if any suitable zone becomes balanced to reduce the number of pages
> > it reclaims from other zones. kswapd still tries to ensure that order-0
> > watermarks for all zones are met before sleeping.
> >
> > Signed-off-by: Mel Gorman <mel@xxxxxxxxx>
>
> <snip>
>
> > -       if (!all_zones_ok) {
> > +       if (!(all_zones_ok || (order && any_zone_ok))) {
> >                cond_resched();
> >
> >                try_to_freeze();
> > @@ -2361,6 +2366,31 @@ out:
> >                goto loop_again;
> >        }
> >
> > +       /*
> > +        * If kswapd was reclaiming at a higher order, it has the option of
> > +        * sleeping without all zones being balanced. Before it does, it must
> > +        * ensure that the watermarks for order-0 on *all* zones are met and
> > +        * that the congestion flags are cleared
> > +        */
> > +       if (order) {
> > +               for (i = 0; i <= end_zone; i++) {
> > +                       struct zone *zone = pgdat->node_zones + i;
> > +
> > +                       if (!populated_zone(zone))
> > +                               continue;
> > +
> > +                       if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> > +                               continue;
> > +
> > +                       zone_clear_flag(zone, ZONE_CONGESTED);
>
> Why clear ZONE_CONGESTED?
> If you have a cause, please, write down the comment.
>

It's because kswapd is the only mechanism that clears the congestion
flag. If it's not cleared and kswapd goes to sleep, the flag could be
left set causing hard-to-diagnose stalls. I'll add a comment.

> <snip>
>
> First impression on this patch is that it changes scanning behavior as
> well as reclaiming on high order reclaim.

It does affect scanning behaviour for high-order reclaim. Specifically,
it may stop scanning once a zone is balanced within the node. Previously
it would continue scanning until all zones were balanced. Is this what
you are thinking of or something else?

> I can't say old behavior is right but we can't say this behavior is
> right, too although this patch solves the problem. At least, we might
> need some data that shows this patch doesn't have a regression.

How do you suggest it be tested and this data be gathered? I tested a number of
workloads that keep kswapd awake but found no differences of major significant
even though it was using high-order allocations. The problem with identifying
small regressions for high-order allocations is that the state of the system
when lumpy reclaim starts is very important as it determines how much work
has to be done. I did not find major regressions in performance.

For the tests I did run;

fsmark showed nothing useful. iozone showed nothing useful either as it didn't
even wake kswapd. sysbench showed minor performance gains and losses but it
is not useful as it typically does not wake kswapd unless the database is
badly configured.

I ran postmark because it was the closest benchmark to a mail simulator I
had access to. This sucks because it's no longer representative of a mail
server and is more like a crappy filesystem benchmark. To get it closer to a
real server, there was also a program running in the background that mapped
a large anonymous segment and scanned it in blocks.

POSTMARK
postmark-traceonly-v3r1-postmarkpostmark-kanyzone-v2r6-postmark
traceonly-v3r1 kanyzone-v2r6
Transactions per second: 2.00 ( 0.00%) 2.00 ( 0.00%)
Data megabytes read per second: 8.14 ( 0.00%) 8.59 ( 5.24%)
Data megabytes written per second: 18.94 ( 0.00%) 19.98 ( 5.21%)
Files created alone per second: 4.00 ( 0.00%) 4.00 ( 0.00%)
Files create/transact per second: 1.00 ( 0.00%) 1.00 ( 0.00%)
Files deleted alone per second: 34.00 ( 0.00%) 30.00 (-13.33%)
Files delete/transact per second: 1.00 ( 0.00%) 1.00 ( 0.00%)

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 152.4 152.92
Total Elapsed Time (seconds) 5110.96 4847.22

FTrace Reclaim Statistics: vmscan
postmark-traceonly-v3r1-postmarkpostmark-kanyzone-v2r6-postmark
traceonly-v3r1 kanyzone-v2r6
Direct reclaims 0 0
Direct reclaim pages scanned 0 0
Direct reclaim pages reclaimed 0 0
Direct reclaim write file async I/O 0 0
Direct reclaim write anon async I/O 0 0
Direct reclaim write file sync I/O 0 0
Direct reclaim write anon sync I/O 0 0
Wake kswapd requests 0 0
Kswapd wakeups 2177 2174
Kswapd pages scanned 34690766 34691473
Kswapd pages reclaimed 34511965 34513478
Kswapd reclaim write file async I/O 32 0
Kswapd reclaim write anon async I/O 2357 2561
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 0.00 0.00
Time kswapd awake (seconds) 632.10 683.34

Total pages scanned 34690766 34691473
Total pages reclaimed 34511965 34513478
%age total pages scanned/reclaimed 99.48% 99.49%
%age total pages scanned/written 0.01% 0.01%
%age file pages scanned/written 0.00% 0.00%
Percentage Time Spent Direct Reclaim 0.00% 0.00%
Percentage Time kswapd Awake 12.37% 14.10%

proc vmstat: Faults
postmark-traceonly-v3r1-postmarkpostmark-kanyzone-v2r6-postmark
traceonly-v3r1 kanyzone-v2r6
Major Faults 1979 1741
Minor Faults 13660834 13587939
Page ins 89060 74704
Page outs 69800 58884
Swap ins 1193 1499
Swap outs 2403 2562

Still, IO performance was improved (higher rates of read/write) and the test
completed significantly faster with this patch series applied. kswapd was
awake for longer and reclaimed marginally more pages with more swap-ins and
swap-outs which is unfortunate but it's somewhat balanced by fewer faults
and fewer page-ins. Basically, in terms of reclaim the figures are so close
that it is within the performance variations lumpy reclaim has depending on
the exact state of the system when reclaim starts.

> It's
> not easy but I believe you can do very well as like having done until
> now. I didn't see whole series so I might miss something.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/