Re: [PATCH v2] perf script python: integrate page reclaim analyze script

From: Mel Gorman
Date: Tue Oct 01 2019 - 10:45:29 EST


On Mon, Sep 30, 2019 at 11:19:44PM -0400, Yafang Shao wrote:
> A new perf script page-reclaim is introduced in this patch. This new script
> is used to report the page reclaim details. The possible usage of this
> script is as bellow,
> - identify latency spike caused by direct reclaim
> - whehter the latency spike is relevant with pageout
> - why is page reclaim requested, i.e. whether it is because of memory
> fragmentation
> - page reclaim efficiency
> etc
> In the future we may also enhance it to analyze the memcg reclaim.
>

Hi,

I ended up not reviewing this patch in detail simply because I would
approach the same class of problem in an entirely different way today.
There is value in accumulating the stats in a report like this;

> $ perf script report page-reclaim
> Direct reclaims: 4924
> Direct latency (ms) total max avg min
> 177823.211 6378.977 36.114 0.051
> Direct file reclaimed 22920
> Direct file scanned 28306
> Direct file sync write I/O 0
> Direct file async write I/O 0
> Direct anon reclaimed 212567
> Direct anon scanned 1446854
> Direct anon sync write I/O 0
> Direct anon async write I/O 278325
> Direct order 0 1 3
> 4870 23 31
> Wake kswapd requests 716
> Wake order 0 1
> 715 1
>
> Kswapd reclaims: 9

However, the basic option I would prefer is having the raw latency
information for Direct latency that can be externally parsed by R or any
other statistical method. The reason why is because knowing the max latency
is not enough, I'd want to know the spread of latencies and whether they
were clustered at a point of time or spread out over long periods of
time. I would then build the higher-level reports on top if necessary.

Today, I would also have considered getting the latency figures using eBPF
or systemtap instead although having perf do it may be useful too. That's
not universally popular though so at minimum I would have;

perf script record page-reclaim -- capture all page-reclaim tracepoints
perf script report page-reclaim -- For reclaim entry/exit, merge the two
tracepoints into one that reports latency. Dump the rest out
verbatim

For latencies, I would externally post-process them until such time as I
found a common class of bug that needed a high-level report and then
build the perf script support for it.

Please note that I did not spot anything wrong with your script, it's
just that I would not use it myself in its current format for debugging
a reclaim-related problem.

--
Mel Gorman
SUSE Labs