Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

From: Mel Gorman
Date: Mon Jul 04 2016 - 00:34:18 EST


On Mon, Jul 04, 2016 at 10:37:03AM +0900, Minchan Kim wrote:
> > The reason we have zone-based reclaim is that we used to have
> > large highmem zones in common configurations and it was necessary
> > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> > less of a concern as machines with lots of memory will (or should) use
> > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> > rare. Machines that do use highmem should have relatively low highmem:lowmem
> > ratios than we worried about in the past.
>
> Hello Mel,
>
> I agree the direction absolutely. However, I have a concern on highmem
> system as you already mentioned.
>
> Embedded products still use 2 ~ 3 ratio (highmem:lowmem).
> In such system, LRU churning by skipping other zone pages frequently
> might be significant for the performance.
>
> How big ratio between highmem:lowmem do you think a problem?
>

That's a "how long is a piece of string" type question. The ratio does
not matter as much as whether the workload is both under memory pressure
and requires large amounts of lowmem pages. Even on systems with very high
ratios, it may not be a problem if HIGHPTE is enabled.

> >
> > Conceptually, moving to node LRUs should be easier to understand. The
> > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > similarly on all nodes.
> >
> > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > core NUMA machine. The UMA results are presented in most cases as the NUMA
> > machine behaved similarly.
>
> I guess you would already test below with various highmem system(e.g.,
> 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
>

I haven't that data, the baseline distribution used doesn't even have
32-bit support. Even if it was, the results may not be that interesting.
The workloads used were not necessarily going to trigger lowmem pressure
as HIGHPTE was set on the 32-bit configs.

The skip logic has been checked and it does work. This was done during
development, by forcing the "wrong" reclaim index to use. It was
noticable in system CPU usage and in the "skip" stats. I didn't preserve
this data.

> > 4.7.0-rc4 4.7.0-rc4
> > mmotm-20160623nodelru-v8
> > Minor Faults 645838 644036
> > Major Faults 573 593
> > Swap Ins 0 0
> > Swap Outs 0 0
> > Allocation stalls 24 0
> > DMA allocs 0 0
> > DMA32 allocs 46041453 44154171
> > Normal allocs 78053072 79865782
> > Movable allocs 0 0
> > Direct pages scanned 10969 54504
> > Kswapd pages scanned 93375144 93250583
> > Kswapd pages reclaimed 93372243 93247714
> > Direct pages reclaimed 10969 54504
> > Kswapd efficiency 99% 99%
> > Kswapd velocity 13741.015 13711.950
> > Direct efficiency 100% 100%
> > Direct velocity 1.614 8.014
> > Percentage direct scans 0% 0%
> > Zone normal velocity 8641.875 13719.964
> > Zone dma32 velocity 5100.754 0.000
> > Zone dma velocity 0.000 0.000
> > Page writes by reclaim 0.000 0.000
> > Page writes file 0 0
> > Page writes anon 0 0
> > Page reclaim immediate 37 54
> >
> > kswapd activity was roughly comparable. There were differences in direct
> > reclaim activity but negligible in the context of the overall workload
> > (velocity of 8 pages per second with the patches applied, 1.6 pages per
> > second in the baseline kernel).
>
> Hmm, nodelru's allocation stall is zero above but how does direct page
> scanning/reclaimed happens?
>

Good spot, it's because I used the wrong comparison script -- one that
doesn't understand the different skip and allocation stats and I was
looking primarily at the scanning activity. This is a correct version

4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8r26
Minor Faults 645838 643815
Major Faults 573 493
Swap Ins 0 0
Swap Outs 0 0
DMA allocs 0 0
DMA32 allocs 46041453 44174923
Normal allocs 78053072 79816443
Movable allocs 0 0
Allocation stalls 24 31
Stall zone DMA 0 0
Stall zone DMA32 0 0
Stall zone Normal 0 1
Stall zone HighMem 0 0
Stall zone Movable 0 30
Direct pages scanned 10969 14198
Kswapd pages scanned 93375144 93252534
Kswapd pages reclaimed 93372243 93249856
Direct pages reclaimed 10969 14198
Kswapd efficiency 99% 99%
Kswapd velocity 13741.015 13742.771
Direct efficiency 100% 100%
Direct velocity 1.614 2.092
Percentage direct scans 0% 0%
Page writes by reclaim 0 0
Page writes file 0 0
Page writes anon 0 0
Page reclaim immediate 37 29

The points about kswapd and direct reclaim activity still hold.

> Above, DMA32 allocs in nodelru is almost same but zone dma32 velocity
> is zero. What does it means?
>

It's a consequence of using the wrong script when cutting and pasting
the final data. With node-lru, "zone dma32 velocity" is meaningless and
the reporting script no longer includes it.

--
Mel Gorman
SUSE Labs