On Tue, May 14, 2019 at 01:44:35PM -0700, Yang Shi wrote:
On 5/13/19 11:20 PM, Michal Hocko wrote:Yeah, sc->nr_scanned is used for three things right now:
On Mon 13-05-19 21:36:59, Yang Shi wrote:Thanks for reminding. Yes, I remembered nr_scanned would double slab
On Mon, May 13, 2019 at 2:45 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote:I have to get in sync with the recent changes. I am aware there were
On Mon 13-05-19 14:09:59, Yang Shi wrote:I don't know why my company mailbox didn't receive this email, so I
[...]
I think we can just account 512 base pages for nr_scanned forBe careful. nr_scanned is used as a pressure indicator to slab shrinking
isolate_lru_pages() to make the counters sane since PGSCAN_KSWAPD/DIRECT
just use it.
And, sc->nr_scanned should be accounted as 512 base pages too otherwise we
may have nr_scanned < nr_to_reclaim all the time to result in false-negative
for priority raise and something else wrong (e.g. wrong vmpressure).
AFAIR. Maybe this is ok but it really begs for much more explaining
replied with my personal email.
It is not used to double slab pressure any more since commit
9092c71bb724 ("mm: use sc->priority for slab shrink targets"). It uses
sc->priority to determine the pressure for slab shrinking now.
So, I think we can just remove that "double slab pressure" code. It is
not used actually and looks confusing now. Actually, the "double slab
pressure" does something opposite. The extra inc to sc->nr_scanned
just prevents from raising sc->priority.
some patches floating around but I didn't get to review them. I was
trying to point out that nr_scanned used to have a side effect to be
careful about. If it doesn't have anymore then this is getting much more
easier of course. Please document everything in the changelog.
pressure. But, when I inspected into the code yesterday, it turns out it is
not true anymore. I will run some test to make sure it doesn't introduce
regression.
1. vmpressure - this looks at the scanned/reclaimed ratio so it won't
change semantics as long as scanned & reclaimed are fixed in parallel
2. compaction/reclaim - this is broken. Compaction wants a certain
number of physical pages freed up before going back to compacting.
Without Yang Shi's fix, we can overreclaim by a factor of 512.
3. kswapd priority raising - this is broken. kswapd raises priority if
we scan fewer pages than the reclaim target (which itself is obviously
expressed in order-0 pages). As a result, kswapd can falsely raise its
aggressiveness even when it's making great progress.
Both sc->nr_scanned & sc->nr_reclaimed should be fixed.
BTW, I noticed the counter of memory reclaim is not correct with THP swap onOuch, how is that possible with the current code?
vanilla kernel, please see the below:
pgsteal_kswapd 21435
pgsteal_direct 26573329
pgscan_kswapd 3514
pgscan_direct 14417775
pgsteal is always greater than pgscan, my patch could fix the problem.
I think it happens when isolate_lru_pages() counts 1 nr_scanned for a
THP, then shrink_page_list() splits the THP and we reclaim tail pages
one by one. This goes all the way back to the initial THP patch!
isolate_lru_pages() needs to be fixed. Its return value, nr_taken, is
correct, but its *nr_scanned parameter is wrong, which causes issues:
1. The trace point, as Yang Shi pointed out, will underreport the
number of pages scanned, as it reports it along with nr_to_scan (base
pages) and nr_taken (base pages)
2. vmstat and memory.stat count 'struct page' operations rather than
base pages, which makes zero sense to neither user nor kernel
developers (I routinely multiply these counters by 4096 to get a sense
of work performed).
All of isolate_lru_pages()'s accounting should be in base pages, which
includes nr_scanned and PGSCAN_SKIPPED.
That should also simplify the code; e.g.:
for (total_scan = 0;
scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src);
total_scan++) {
scan < nr_to_scan && nr_taken >= nr_to_scan is a weird condition that
does not make sense in page reclaim imo. Reclaim cares about physical
memory - freeing one THP is as much progress for reclaim as freeing
512 order-0 pages.
IMO *all* '++' in vmscan.c are suspicious and should be reviewed:
nr_scanned, nr_reclaimed, nr_dirty, nr_unqueued_dirty, nr_congested,
nr_immediate, nr_writeback, nr_ref_keep, nr_unmap_fail, pgactivate,
total_scan & scan, nr_skipped.
Yang Shi, it would be nice if you could convert all of these to base
page accounting in one patch, as it's a single logical fix for the
initial introduction of THP that had huge pages show up on the LRUs.\
[ check_move_unevictable_pages() seems weird. It gets a pagevec from
find_get_entries(), which, if I understand the THP page cache code
correctly, might contain the same compound page over and over. It'll
be !unevictable after the first iteration, so will only run once. So
it produces incorrect numbers now, but it is probably best to ignore
it until we figure out THP cache. Maybe add an XXX comment. ]