Re: [PATCH 1/5] mm: kswapd: Stop high-order balancing when anysuitable zone is balanced

From: Mel Gorman
Date: Tue Dec 07 2010 - 04:49:29 EST


On Tue, Dec 07, 2010 at 10:32:45AM +0900, Minchan Kim wrote:
> On Mon, Dec 6, 2010 at 7:55 PM, Mel Gorman <mel@xxxxxxxxx> wrote:
> > On Mon, Dec 06, 2010 at 08:35:18AM +0900, Minchan Kim wrote:
> >> Hi Mel,
> >>
> >> On Fri, Dec 3, 2010 at 8:45 PM, Mel Gorman <mel@xxxxxxxxx> wrote:
> >> > When the allocator enters its slow path, kswapd is woken up to balance the
> >> > node. It continues working until all zones within the node are balanced. For
> >> > order-0 allocations, this makes perfect sense but for higher orders it can
> >> > have unintended side-effects. If the zone sizes are imbalanced, kswapd may
> >> > reclaim heavily within a smaller zone discarding an excessive number of
> >> > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> >> > even though plenty of pages are free from a suitable zone.
> >> >
> >> > This patch alters the "balance" logic for high-order reclaim allowing kswapd
> >> > to stop if any suitable zone becomes balanced to reduce the number of pages
> >> > it reclaims from other zones. kswapd still tries to ensure that order-0
> >> > watermarks for all zones are met before sleeping.
> >> >
> >> > Signed-off-by: Mel Gorman <mel@xxxxxxxxx>
> >>
> >> <snip>
> >>
> >> > -       if (!all_zones_ok) {
> >> > +       if (!(all_zones_ok || (order && any_zone_ok))) {
> >> >                cond_resched();
> >> >
> >> >                try_to_freeze();
> >> > @@ -2361,6 +2366,31 @@ out:
> >> >                goto loop_again;
> >> >        }
> >> >
> >> > +       /*
> >> > +        * If kswapd was reclaiming at a higher order, it has the option of
> >> > +        * sleeping without all zones being balanced. Before it does, it must
> >> > +        * ensure that the watermarks for order-0 on *all* zones are met and
> >> > +        * that the congestion flags are cleared
> >> > +        */
> >> > +       if (order) {
> >> > +               for (i = 0; i <= end_zone; i++) {
> >> > +                       struct zone *zone = pgdat->node_zones + i;
> >> > +
> >> > +                       if (!populated_zone(zone))
> >> > +                               continue;
> >> > +
> >> > +                       if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> >> > +                               continue;
> >> > +
> >> > +                       zone_clear_flag(zone, ZONE_CONGESTED);
> >>
> >> Why clear ZONE_CONGESTED?
> >> If you have a cause, please, write down the comment.
> >>
> >
> > It's because kswapd is the only mechanism that clears the congestion
> > flag. If it's not cleared and kswapd goes to sleep, the flag could be
> > left set causing hard-to-diagnose stalls. I'll add a comment.
>
> Seems good.
>

Ok.

> >
> >> <snip>
> >>
> >> First impression on this patch is that it changes scanning behavior as
> >> well as reclaiming on high order reclaim.
> >
> > It does affect scanning behaviour for high-order reclaim. Specifically,
> > it may stop scanning once a zone is balanced within the node. Previously
> > it would continue scanning until all zones were balanced. Is this what
> > you are thinking of or something else?
>
> Yes. I mean page aging of high zones.
>

When high-order node balancing is finished (aging zones as before), a
check is made to ensure that all zones are balanced for order-0. If not,
kswapd stays awake continueing to age zones as before. Zones will not age
as aggressively now that high-order balancing finishes but as part of the
bug report is too many pages being freed by kswapd, this is a good thing.

> >
> >> I can't say old behavior is right but we can't say this behavior is
> >> right, too although this patch solves the problem. At least, we might
> >> need some data that shows this patch doesn't have a regression.
> >
> > How do you suggest it be tested and this data be gathered? I tested a number of
> > workloads that keep kswapd awake but found no differences of major significant
> > even though it was using high-order allocations. The  problem with identifying
> > small regressions for high-order allocations is that the state of the system
> > when lumpy reclaim starts is very important as it determines how much work
> > has to be done. I did not find major regressions in performance.
> >
> > For the tests I did run;
> >
> > fsmark showed nothing useful. iozone showed nothing useful either as it didn't
> > even wake kswapd. sysbench showed minor performance gains and losses but it
> > is not useful as it typically does not wake kswapd unless the database is
> > badly configured.
> >
> > I ran postmark because it was the closest benchmark to a mail simulator I
> > had access to. This sucks because it's no longer representative of a mail
> > server and is more like a crappy filesystem benchmark. To get it closer to a
> > real server, there was also a program running in the background that mapped
> > a large anonymous segment and scanned it in blocks.
> >
> > POSTMARK
> >            postmark-traceonly-v3r1-postmarkpostmark-kanyzone-v2r6-postmark
> >                traceonly-v3r1     kanyzone-v2r6
> > Transactions per second:                2.00 ( 0.00%)     2.00 ( 0.00%)
> > Data megabytes read per second:         8.14 ( 0.00%)     8.59 ( 5.24%)
> > Data megabytes written per second:     18.94 ( 0.00%)    19.98 ( 5.21%)
> > Files created alone per second:         4.00 ( 0.00%)     4.00 ( 0.00%)
> > Files create/transact per second:       1.00 ( 0.00%)     1.00 ( 0.00%)
> > Files deleted alone per second:        34.00 ( 0.00%)    30.00 (-13.33%)
>
> Do you know the reason only file deletion has a big regression?
>

I'm guessing bad luck because it's not stable. There is a large memory
consumer running in the background. If the timing of when it got swapped
out changed, it could have regressed. It's not very stable between runs.
Sometimes the files deleted is not affected at all but every time the
read/writes per second is higher and the total time to completion is lower.

> > Files delete/transact per second:       1.00 ( 0.00%)     1.00 ( 0.00%)
> >
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)         152.4    152.92
> > Total Elapsed Time (seconds)               5110.96   4847.22
> >
> > FTrace Reclaim Statistics: vmscan
> >            postmark-traceonly-v3r1-postmarkpostmark-kanyzone-v2r6-postmark
> >                traceonly-v3r1     kanyzone-v2r6
> > Direct reclaims                                  0          0
> > Direct reclaim pages scanned                     0          0
> > Direct reclaim pages reclaimed                   0          0
> > Direct reclaim write file async I/O              0          0
> > Direct reclaim write anon async I/O              0          0
> > Direct reclaim write file sync I/O               0          0
> > Direct reclaim write anon sync I/O               0          0
> > Wake kswapd requests                             0          0
> > Kswapd wakeups                                2177       2174
> > Kswapd pages scanned                      34690766   34691473
>
> Perhaps, in your workload, any_zone is highest zone.
> If any_zone became low zone, kswapd pages scanned would have a big
> difference because old behavior try to balance all zones.

It'll still balance the zones for order-0, the size we care most about.

> Could we evaluate this situation? but I have no idea how we set up the
> situation. :(
>

See the reset of the series. The main consequence of any_zone being a low
zone is that balancing can stop because ZONE_DMA is balanced even though it
is unusable for allocations. Patch 3 takes the classzone_idx into account
to identify when deciding if kswapd should go to sleep. The final patch in
the series replaces "any zone" with "at least 25% of the pages making up
the node must be balanced". The situation could be forced artifically by
preventing pages ever being allocated from ZONE_DMA but we wouldn't be able
to draw any sensible conclusion from it as patch 5 in the series handles it.
This is why I'm depending on Simon's reports to see if his corner case is fixed
while running other stress tests to see if anything else is noticeably worse.

> > Kswapd pages reclaimed                    34511965   34513478
> > Kswapd reclaim write file async I/O             32          0
> > Kswapd reclaim write anon async I/O           2357       2561
> > Kswapd reclaim write file sync I/O               0          0
> > Kswapd reclaim write anon sync I/O               0          0
> > Time stalled direct reclaim (seconds)         0.00       0.00
> > Time kswapd awake (seconds)                 632.10     683.34
> >
> > Total pages scanned                       34690766  34691473
> > Total pages reclaimed                     34511965  34513478
> > %age total pages scanned/reclaimed          99.48%    99.49%
> > %age total pages scanned/written             0.01%     0.01%
> > %age  file pages scanned/written             0.00%     0.00%
> > Percentage Time Spent Direct Reclaim         0.00%     0.00%
> > Percentage Time kswapd Awake                12.37%    14.10%
>
> Is "kswapd Awake" correct?
> AFAIR, In your implementation, you seems to account kswapd time even
> though kswapd are schedule out.
> I mean, for example,
>
> kswapd
> -> time stamp start
> -> balance_pgdat
> -> cond_resched(kswapd schedule out)
> -> app 1 start
> -> app 2 start
> -> kswapd schedule in
> -> time stamp end.
>
> If it's right, kswapd awake doesn't have a big meaning.
>

"Time kswapd awake" is the time between when

Trace event mm_vmscan_kswapd_wake is recorded while kswapd is asleep
Trave event mm_vmscan_kswapd_sleep is recorded just before kswapd calls
schedule() to properly go to sleep.

It's possible to receive mm_vmscan_kswapd_wake multiple times while kswapd
is asleep but it is ignored.

If kswapd schedules out normally or is stalled on direct writeback, this
time is included in the above value. Maybe a better name for this is
"kswapd active".

> >
> > proc vmstat: Faults
> >            postmark-traceonly-v3r1-postmarkpostmark-kanyzone-v2r6-postmark
> >                traceonly-v3r1     kanyzone-v2r6
> > Major Faults                                  1979      1741
> > Minor Faults                              13660834  13587939
> > Page ins                                     89060     74704
> > Page outs                                    69800     58884
> > Swap ins                                      1193      1499
> > Swap outs                                     2403      2562
> >
> > Still, IO performance was improved (higher rates of read/write) and the test
> > completed significantly faster with this patch series applied.  kswapd was
> > awake for longer and reclaimed marginally more pages with more swap-ins and
>
> Longer wake may be due to wrong gathering of time as I said.
>

Possibly, but I don't think so. I'm more inclined to blame the
effectively random interaction between postmark and the memory consumer
running in the background.

> > swap-outs which is unfortunate but it's somewhat balanced by fewer faults
> > and fewer page-ins. Basically, in terms of reclaim the figures are so close
> > that it is within the performance variations lumpy reclaim has depending on
> > the exact state of the system when reclaim starts.
>
> What I wanted to see is that when if zones above any_zone isn't aging
> how it affect system performance.

The only test I ran that would be affected is a streaming IO test but
it's only one aspect of memory reclaim behaviour (albeit it one that
people tend to complain about when it's broken)

> This patch is changing balancing mechanism of kswapd so I think the
> experiment is valuable.
> I don't want to make contributors to be tired by bad reviewer.
> What do you think about that?
>

About all I can report on is the streaming IO benchmarks results which
looks like;

MICRO
traceonly kanyzone
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 24.23 23.93
Total Elapsed Time (seconds) 916.18 916.69

FTrace Reclaim Statistics: vmscan
traceonly kanyzone
Direct reclaims 2437 2565
Direct reclaim pages scanned 1688201 1801142
Direct reclaim write file async I/O 0 0
Direct reclaim write anon async I/O 14 0
Direct reclaim write file sync I/O 0 0
Direct reclaim write anon sync I/O 0 0
Wake kswapd requests 1333358 1417622
Kswapd wakeups 107 116
Kswapd pages scanned 15801484 15706394
Kswapd reclaim write file async I/O 44 24
Kswapd reclaim write anon async I/O 25 0
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 1.79 0.98
Time kswapd awake (seconds) 387.60 410.26

Total pages scanned 17489685 17507536
%age total pages scanned/reclaimed 0.00% 0.00%
%age total pages scanned/written 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00%
Percentage Time Spent Direct Reclaim 6.88% 3.93%
Percentage Time kswapd Awake 42.31% 44.75%

proc vmstat: Faults
micro-traceonly-v3r1-micromicro-kanyzone-v3r1-micro
traceonly-v3r1 kanyzone-v3r1
Major Faults 1943 1808
Minor Faults 55488625 55441993
Page ins 134044 126640
Page outs 73884 69248
Swap ins 2322 1972
Swap outs 7291 6521

Total pages scanned differ by 0.1% which is not much. Time to completion
is more or less the same. Faults, paging activity and swap activity are
all slightly reduced.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/