Re: [PATCH v2 00/16] Multigenerational LRU Framework

From: Yu Zhao
Date: Sat Apr 24 2021 - 17:21:50 EST

Next message: Emmanuel Gil Peyrot: "[PATCH v2 0/3] eeprom-93xx46: Add support for Atmel AT93C56 and AT93C66"
Previous message: Ansuel Smith: "Re: [PATCH 11/14] drivers: net: dsa: qca8k: apply switch revision fix"
In reply to: Yu Zhao: "Re: [PATCH v2 00/16] Multigenerational LRU Framework"
Next in thread: Yu Zhao: "Re: [PATCH v2 00/16] Multigenerational LRU Framework"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Apr 14, 2021 at 7:36 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> > On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > > Profiles would be interesting, because it sounds to me like reclaim
> > > > > *might* be batching page cache removal better (e.g. fewer, larger
> > > > > batches) and so spending less time contending on the mapping tree
> > > > > lock...
> > > > >
> > > > > IOWs, I suspect this result might actually be a result of less lock
> > > > > contention due to a change in batch processing characteristics of
> > > > > the new algorithm rather than it being a "better" algorithm...
> > > >
> > > > I appreciate the profile. But there is no batching in
> > > > __remove_mapping() -- it locks the mapping for each page, and
> > > > therefore the lock contention penalizes the mainline and this patchset
> > > > equally. It looks worse on your system because the four kswapd threads
> > > > from different nodes were working on the same file.
> > >
> > > I think you misunderstand exactly what I mean by "batching" here.
> > > I'm not talking about doing multiple pieces of work under a single
> > > lock. What I mean is that the overall amount of work done in a
> > > single reclaim scan (i.e a "reclaim batch") is packaged differently.
> > >
> > > We already batch up page reclaim via building a page list and then
> > > passing it to shrink_page_list() to process the batch of pages in a
> > > single pass. Each page in this page list batch then calls
> > > remove_mapping() to pull the page form the LRU, we have a run of
> > > contention between the foreground read() thread and the background
> > > kswapd.
> > >
> > > If the size or nature of the pages in the batch passed to
> > > shrink_page_list() changes, then the amount of time a reclaim batch
> > > is going to put pressure on the mapping tree lock will also change.
> > > That's the "change in batching behaviour" I'm referring to here. I
> > > haven't read through the patchset to determine if you change the
> > > shrink_page_list() algorithm, but it likely changes what is passed
> > > to be reclaimed and that in turn changes the locking patterns that
> > > fall out of shrink_page_list...
> >
> > Ok, if we are talking about the size of the batch passed to
> > shrink_page_list(), both the mainline and this patchset cap it at
> > SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when
> > running fio/io_uring, it's safe to say both use 32.
>
> You're still looking at micro-scale behaviour, not the larger-scale
> batching effects. Are we passing SWAP_CLUSTER_MAX groups of pages to
> shrinker_page_list() at a different rate?
>
> When I say "batch of work" when talking about the page cache cycling
> *500 thousand pages a second* through the cache, I'm not talking
> about batches of 32 pages. I'm talking about the entire batch of
> work kswapd does in an invocation cycle.
>
> Is it scanning 100k pages 10 times a second? or 10k pages a hundred
> times a second? How long does a batch take to run? how long does is
> sleep between processing batches? Is there any change in these
> metrics as a result of the multi-gen LRU patches?

Hi Dave,

Sorry for the late reply. Still catching up on emails.

Well, it doesn't really work that way. Yes, I agree that batching
theoretically can have effects on the performance but the patchset
doesn't change anything in this respect. The number of pages to
reclaim is determined by a common code path shared between the
existing implementation and this patchset. Specifically, ksawpd sets
"sc->nr_to_reclaim" based on the high watermark, and passes "sc" down
to both code path:

static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
<snipped>
+ if (lru_gen_enabled()) {
+ shrink_lru_gens(lruvec, sc);
+ return;
+ }
<snipped>

And there isn't really any new algorithm. It's just the old plain LRU.
The improvement is purely from a feedback loop that helps avoid
unnecessary activations and deactivations. By activations, I mean the
following work done in the buffered io read path:
generic_file_read_iter()
filemap_read()
mark_page_accessed()
activate_page()
Simply put, on the second accessed, the current implementation
unconditionally moves a page from the inactive lru to the active lru,
if it's not already there.

And by deactivations, I mean the following work done in kswapd:
kswapd_shrink_node()
shrink_node()
shrink_node_memcgs()
shrink_lruvec()
shrink_active_list()
i.e., kswapd moves activated pages back to the inactive list before it
can evict them.

For random accesses, every page is equal and none is active. But how
can we tell? The refault rates, i.e., evicted and then accessed again,
for all pages are the same, no matter how many times they have been
accessed. This is exactly how the feedback loop works.

> Basically, we're looking at how access to the mapping lock is
> changing the contention profile, and whether that is signficant or
> not. I suspect it is, because when you have highly contended locks
> and you do something external that reduces unrelated lock
> contention, it's because that external thing is taking more time to
> do and so there's less time to spend hitting locks hard...
>
> As such, I don't think this test is a good measure of the multi-gen
> LRU patches at all - performance is dominated by the severity of
> lock contention external to the LRU scanning algorithm, and it's
> hard to infer anything through suck lock contention....

The lock contention you saw earlier is because of the four nodes
system you used -- each node has a kswapd thread but there is only one
file. The four kswapd threads keep banging on the same mapping lock.
This can be easily fixed if we just run the same test with a single
node *or* with multiple files.

Here is what I got (it's basically what I've attached in the previous email).

Before the patchset:

kswapd
# Children Self Symbol
# ........ ........ .......................................
100.00% 0.00% kswapd
99.90% 0.01% balance_pgdat
99.86% 0.05% shrink_node
98.00% 0.19% shrink_lruvec
91.13% 0.10% shrink_inactive_list
75.67% 5.97% shrink_page_list
57.65% 2.59% __remove_mapping
45.99% 0.27% __delete_from_page_cache
42.88% 0.88% page_cache_delete
39.05% 1.04% xas_store
37.71% 37.67% xas_create
12.62% 11.79% isolate_lru_pages
8.52% 0.98% free_unref_page_list
7.38% 0.61% free_unref_page_commit
6.68% 1.78% free_pcppages_bulk
6.53% 0.83% ***shrink_active_list***
6.45% 3.21% _raw_spin_lock_irqsave
4.82% 4.60% __free_one_page
4.58% 4.21% unlock_page
3.62% 3.55% native_queued_spin_lock_slowpath
2.49% 0.71% unaccount_page_cache_page
2.46% 0.81% workingset_eviction
2.14% 0.33% __mod_lruvec_state
1.97% 1.88% xas_clear_mark
1.73% 0.26% __mod_lruvec_page_state
1.71% 1.06% move_pages_to_lru
1.66% 1.62% workingset_age_nonresident
1.60% 0.85% __mod_memcg_lruvec_state
1.58% 0.02% shrink_slab
1.49% 0.13% mem_cgroup_uncharge_list
1.45% 0.06% do_shrink_slab
1.37% 1.32% page_mapping
1.06% 0.76% count_shadow_nodes

io_uring
# Children Self Symbol
# ........ ........ ........................................
99.22% 0.00% do_syscall_64
94.48% 0.22% __io_queue_sqe
94.09% 0.31% io_issue_sqe
93.33% 0.62% io_read
88.57% 0.93% blkdev_read_iter
87.80% 0.15% io_iter_do_read
87.57% 0.16% generic_file_read_iter
87.35% 1.08% filemap_read
82.44% 0.01% __x64_sys_io_uring_enter
82.34% 0.01% __do_sys_io_uring_enter
82.09% 0.47% io_submit_sqes
79.50% 0.08% io_queue_sqe
71.47% 1.08% filemap_get_pages
53.08% 0.35% ondemand_readahead
51.74% 1.49% page_cache_ra_unbounded
49.92% 0.13% page_cache_sync_ra
21.26% 0.63% add_to_page_cache_lru
17.08% 0.02% task_work_run
16.98% 0.03% exit_to_user_mode_prepare
16.94% 10.52% __add_to_page_cache_locked
16.94% 0.06% tctx_task_work
16.79% 0.19% syscall_exit_to_user_mode
16.41% 2.61% filemap_get_read_batch
16.08% 15.72% xas_load
15.99% 0.15% read_pages
15.87% 0.13% blkdev_readahead
15.72% 0.56% mpage_readahead
15.58% 0.09% io_req_task_submit
15.37% 0.07% __io_req_task_submit
12.14% 0.15% copy_page_to_iter
12.03% 0.03% ***__page_cache_alloc***
11.98% 0.14% alloc_pages_current
11.66% 0.30% __alloc_pages_nodemask
11.28% 11.05% _copy_to_iter
11.06% 3.16% get_page_from_freelist
10.53% 0.10% submit_bio
10.27% 0.21% submit_bio_noacct
8.30% 0.51% blk_mq_submit_bio
7.73% 7.62% clear_page_erms
3.68% 1.02% do_mpage_readpage
3.33% 0.01% page_cache_async_ra
3.25% 0.01% blk_flush_plug_list
3.24% 0.06% blk_mq_flush_plug_list
3.18% 0.01% blk_mq_sched_insert_requests
3.15% 0.13% blk_mq_try_issue_list_directly
2.84% 0.15% __blk_mq_try_issue_directly
2.69% 0.58% nvme_queue_rq
2.54% 1.05% blk_attempt_plug_merge
2.53% 0.44% ***mark_page_accessed***
2.36% 0.12% rw_verify_area
2.16% 0.09% mpage_alloc
2.10% 0.27% lru_cache_add
1.81% 0.27% security_file_permission
1.75% 0.87% __pagevec_lru_add
1.70% 0.63% _raw_spin_lock_irq
1.53% 0.40% xa_get_order
1.52% 0.15% __blk_mq_alloc_request
1.50% 0.65% workingset_refault
1.50% 0.07% activate_page
1.48% 0.29% io_submit_flush_completions
1.46% 0.32% bio_alloc_bioset
1.42% 0.15% xa_load
1.40% 0.16% pagevec_lru_move_fn
1.39% 0.06% blk_finish_plug
1.35% 0.28% submit_bio_checks
1.26% 0.02% asm_common_interrupt
1.25% 0.01% common_interrupt
1.21% 1.21% native_queued_spin_lock_slowpath
1.18% 0.10% mempool_alloc
1.17% 0.00% __common_interrupt
1.14% 0.03% handle_edge_irq
1.08% 0.87% apparmor_file_permission
1.03% 0.19% __mod_lruvec_state
1.02% 0.65% blk_rq_merge_ok
1.01% 0.94% xas_start

After the patchset:

kswapd
# Children Self Symbol
# ........ ........ ........................................
100.00% 0.00% kswapd
99.92% 0.11% balance_pgdat
99.32% 0.03% shrink_node
97.25% 0.32% shrink_lruvec
96.80% 0.09% evict_lru_gen_pages
77.82% 6.28% shrink_page_list
61.61% 2.76% __remove_mapping
50.28% 0.33% __delete_from_page_cache
46.63% 1.08% page_cache_delete
42.20% 1.16% xas_store
40.71% 40.67% xas_create
12.54% 7.76% isolate_lru_gen_pages
6.42% 3.19% _raw_spin_lock_irqsave
6.15% 0.91% free_unref_page_list
5.62% 5.45% unlock_page
5.05% 0.59% free_unref_page_commit
4.35% 2.04% lru_gen_update_size
4.31% 1.41% free_pcppages_bulk
3.43% 3.36% native_queued_spin_lock_slowpath
3.38% 0.59% __mod_lruvec_state
2.97% 0.78% unaccount_page_cache_page
2.82% 2.52% __free_one_page
2.33% 1.18% __mod_memcg_lruvec_state
2.28% 2.17% xas_clear_mark
2.13% 0.30% __mod_lruvec_page_state
1.88% 0.04% shrink_slab
1.82% 1.78% workingset_eviction
1.74% 0.06% do_shrink_slab
1.70% 0.15% mem_cgroup_uncharge_list
1.39% 1.01% count_shadow_nodes
1.22% 1.18% __mod_memcg_state.part.0
1.16% 1.11% page_mapping
1.02% 0.98% xas_init_marks

io_uring
# Children Self Symbol
# ........ ........ ........................................
99.19% 0.01% entry_SYSCALL_64_after_hwframe
99.16% 0.00% do_syscall_64
94.78% 0.18% __io_queue_sqe
94.41% 0.25% io_issue_sqe
93.60% 0.48% io_read
89.35% 0.96% blkdev_read_iter
88.44% 0.12% io_iter_do_read
88.25% 0.16% generic_file_read_iter
88.00% 1.20% filemap_read
84.01% 0.01% __x64_sys_io_uring_enter
83.91% 0.01% __do_sys_io_uring_enter
83.74% 0.37% io_submit_sqes
81.28% 0.07% io_queue_sqe
74.65% 0.96% filemap_get_pages
55.92% 0.35% ondemand_readahead
54.57% 1.34% page_cache_ra_unbounded
51.57% 0.12% page_cache_sync_ra
24.14% 0.51% add_to_page_cache_lru
19.04% 11.51% __add_to_page_cache_locked
18.48% 0.13% read_pages
18.42% 0.18% blkdev_readahead
18.20% 0.55% mpage_readahead
16.81% 2.31% filemap_get_read_batch
16.37% 14.83% xas_load
15.40% 0.02% task_work_run
15.38% 0.03% exit_to_user_mode_prepare
15.31% 0.05% tctx_task_work
15.14% 0.15% syscall_exit_to_user_mode
14.05% 0.04% io_req_task_submit
13.86% 0.05% __io_req_task_submit
12.92% 0.12% submit_bio
11.40% 0.13% copy_page_to_iter
10.65% 9.61% _copy_to_iter
9.45% 0.03% ***__page_cache_alloc***
9.42% 0.16% submit_bio_noacct
9.40% 0.11% alloc_pages_current
9.11% 0.30% __alloc_pages_nodemask
8.53% 1.81% get_page_from_freelist
8.38% 0.10% asm_common_interrupt
8.26% 0.06% common_interrupt
7.75% 0.05% __common_interrupt
7.62% 0.44% blk_mq_submit_bio
7.56% 0.20% handle_edge_irq
6.45% 5.90% clear_page_erms
5.25% 0.10% handle_irq_event
4.88% 0.19% nvme_irq
4.83% 0.07% __handle_irq_event_percpu
4.73% 0.52% nvme_process_cq
4.52% 0.01% page_cache_async_ra
4.00% 0.04% nvme_pci_complete_rq
3.82% 0.04% nvme_complete_rq
3.76% 1.11% do_mpage_readpage
3.74% 0.06% blk_mq_end_request
3.03% 0.01% blk_flush_plug_list
3.02% 0.06% blk_mq_flush_plug_list
2.96% 0.01% blk_mq_sched_insert_requests
2.94% 0.10% blk_mq_try_issue_list_directly
2.89% 0.00% __irqentry_text_start
2.71% 0.41% psi_task_change
2.67% 0.21% lru_cache_add
2.65% 0.14% __blk_mq_try_issue_directly
2.53% 0.54% nvme_queue_rq
2.43% 0.17% blk_update_request
2.42% 0.58% __pagevec_lru_add
2.29% 1.42% psi_group_change
2.22% 0.85% blk_attempt_plug_merge
2.14% 0.04% bio_endio
2.13% 0.11% rw_verify_area
2.08% 0.18% mpage_end_io
2.01% 0.08% mpage_alloc
1.71% 0.56% _raw_spin_lock_irq
1.65% 0.98% workingset_refault
1.65% 0.09% psi_memstall_leave
1.64% 0.20% security_file_permission
1.61% 1.59% _raw_spin_lock
1.58% 0.08% psi_memstall_enter
1.44% 0.37% xa_get_order
1.43% 0.13% __blk_mq_alloc_request
1.37% 0.26% io_submit_flush_completions
1.36% 0.14% xa_load
1.34% 0.31% bio_alloc_bioset
1.31% 0.26% submit_bio_checks
1.29% 0.04% blk_finish_plug
1.28% 1.27% native_queued_spin_lock_slowpath
1.24% 0.19% page_endio
1.13% 0.10% unlock_page
1.09% 0.99% read_tsc
1.07% 0.74% lru_gen_addition
1.03% 0.09% mempool_alloc
1.02% 0.13% wake_up_page_bit
1.02% 0.92% xas_start
1.01% 0.78% apparmor_file_permission

By comparing the two sets, we can clearly see what's changed:

Before the patchset:
6.53% 0.83% ***shrink_active_list***
12.03% 0.03% ***__page_cache_alloc***
2.53% 0.44% ***mark_page_accessed***

After the patchset:
9.45% 0.03% ***__page_cache_alloc***
(There are shrink_active_list() or mark_page_accessed() since we don't
activate and deactivate pages anymore, for this test case.)

Hopefully this is clear enough. But I do see where your skepticism
comes from and I don't want to dismiss it out of hand. So if you have
any other benchmarks, I'd be happy to try them. What do you think?

Thanks.

Next message: Emmanuel Gil Peyrot: "[PATCH v2 0/3] eeprom-93xx46: Add support for Atmel AT93C56 and AT93C66"
Previous message: Ansuel Smith: "Re: [PATCH 11/14] drivers: net: dsa: qca8k: apply switch revision fix"
In reply to: Yu Zhao: "Re: [PATCH v2 00/16] Multigenerational LRU Framework"
Next in thread: Yu Zhao: "Re: [PATCH v2 00/16] Multigenerational LRU Framework"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]