Re: [RFC PATCH] mm: move xa forward when run across zombie page

From: Dave Chinner
Date: Wed Oct 19 2022 - 19:43:09 EST


On Thu, Oct 20, 2022 at 09:46:16AM +1100, Dave Chinner wrote:
> On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > > This is reading and writing the same amount of file data at the
> > > > application level, but once the data has been written and kicked out
> > > > of the page cache it seems to require an awful lot more read IO to
> > > > get it back to the application. i.e. this looks like mmap() is
> > > > readahead thrashing severely, and eventually it livelocks with this
> > > > sort of report:
> > > >
> > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > > [175901.995614] Call Trace:
> > > > [175901.996090] <TASK>
> > > > [175901.996594] ? __schedule+0x301/0xa30
> > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > [175902.000714] ? xas_start+0x53/0xc0
> > > > [175902.001484] ? xas_load+0x24/0xa0
> > > > [175902.002208] ? xas_load+0x5/0xa0
> > > > [175902.002878] ? __filemap_get_folio+0x87/0x340
> > > > [175902.003823] ? filemap_fault+0x139/0x8d0
> > > > [175902.004693] ? __do_fault+0x31/0x1d0
> > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0
> > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0
> > > > [175902.006998] ? exc_page_fault+0x1d9/0x810
> > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30
> > > > [175902.008613] </TASK>
> > > >
> > > > Given that filemap_fault on XFS is probably trying to map large
> > > > folios, I do wonder if this is a result of some kind of race with
> > > > teardown of a large folio...
> > >
> > > It doesn't matter whether we're trying to map a large folio; it
> > > matters whether a large folio was previously created in the cache.
> > > Through the magic of readahead, it may well have been. I suspect
> > > it's not teardown of a large folio, but splitting. Removing a
> > > page from the page cache stores to the pointer in the XArray
> > > first (either NULL or a shadow entry), then decrements the refcount.
> > >
> > > We must be observing a frozen folio. There are a number of places
> > > in the MM which freeze a folio, but the obvious one is splitting.
> > > That looks like this:
> > >
> > > local_irq_disable();
> > > if (mapping) {
> > > xas_lock(&xas);
> > > (...)
> > > if (folio_ref_freeze(folio, 1 + extra_pins)) {
> >
> > But the lookup is not doing anything to prevent the split on the
> > frozen page from making progress, right? It's not holding any folio
> > references, and it's not holding the mapping tree lock, either. So
> > how does the lookup in progress prevent the page split from making
> > progress?
> >
> >
> > > So one way to solve this might be to try to take the xa_lock on
> > > failure to get the refcount. Otherwise a high-priority task
> > > might spin forever without a low-priority task getting the chance
> > > to finish the work being done while the folio is frozen.
> >
> > IIUC, then you are saying that there is a scheduling priority
> > inversion because the lookup failure looping path doesn't yeild the
> > CPU?
> >
> > If so, how does taking the mapping tree spin lock on failure cause
> > the looping task to yield the CPU and hence allow the folio split to
> > make progress?
> >
> > Also, AFAICT, the page split has disabled local interrupts, so it
> > should effectively be running with preemption disabled as it has
> > turned off the mechanism the scheduler can use to preempt it. The
> > page split can't sleep, either, because it holds the mapping tree
> > lock. Hence I can't see how a split-in-progress can be preempted in
> > teh first place to cause a priority inversion livelock like this...
> >
> > > ie this:
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 08341616ae7a..ca0eed80580f 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
> > > if (!folio || xa_is_value(folio))
> > > goto out;
> > >
> > > - if (!folio_try_get_rcu(folio))
> > > + if (!folio_try_get_rcu(folio)) {
> > > + unsigned long flags;
> > > +
> > > + xas_lock_irqsave(&xas, flags);
> > > + xas_unlock_irqrestore(&xas, flags);
> > > goto repeat;
> > > + }
>
> As I suspected, this change did not prevent the livelock. It
> reproduced after just 650 test iterations (a few minutes) with this
> change in place.
>
> > I would have thought:
> >
> > if (!folio_try_get_rcu(folio)) {
> > rcu_read_unlock();
> > cond_resched();
> > rcu_read_lock();
> > goto repeat;
> > }
> >
> > Would be the right way to yeild the CPU to avoid priority
> > inversion related livelocks here...
>
> I'm now trying this just to provide a data point that actually
> yeilding the CPU avoids the downstream effects of the livelock (i.e.

This causes a different problems to occur. First, a task hangs
evicting the page cache on unlink like so:

[ 1352.990246] INFO: task test_write:3572 blocked for more than 122 seconds.
[ 1352.991418] Not tainted 6.1.0-rc1-dgc+ #1618
[ 1352.992378] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1352.994702] task:test_write state:D stack:13088 pid:3572 ppid:3567 flags:0x00004002
[ 1352.996257] Call Trace:
[ 1352.996905] <TASK>
[ 1352.997452] __schedule+0x2f9/0xa20
[ 1352.998380] schedule+0x5a/0xc0
[ 1352.999011] io_schedule+0x42/0x70
[ 1352.999565] folio_wait_bit_common+0x137/0x370
[ 1353.000553] ? dio_warn_stale_pagecache.part.0+0x50/0x50
[ 1353.001425] truncate_inode_pages_range+0x41a/0x470
[ 1353.002443] evict+0x1ad/0x1c0
[ 1353.003199] do_unlinkat+0x1db/0x2e0
[ 1353.003907] __x64_sys_unlinkat+0x33/0x60
[ 1353.004569] do_syscall_64+0x35/0x80
[ 1353.005354] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1353.006315] RIP: 0033:0x7f03d6a42157
[ 1353.007088] RSP: 002b:00007ffde8d98388 EFLAGS: 00000206 ORIG_RAX: 0000000000000107
[ 1353.008367] RAX: ffffffffffffffda RBX: 0000000000000df4 RCX: 00007f03d6a42157
[ 1353.009748] RDX: 0000000000000000 RSI: 000055f8660415a0 RDI: 0000000000000007
[ 1353.011115] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007ffde8d96105
[ 1353.012422] R10: 0000000000000158 R11: 0000000000000206 R12: 0000000000000158
[ 1353.013556] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1353.014958] </TASK>

And soon after:

[ 1607.746983] ------------[ cut here ]------------
[ 1607.748141] WARNING: CPU: 27 PID: 51385 at fs/iomap/buffered-io.c:80 iomap_page_release+0xaf/0xc0
[ 1607.751098] Modules linked in:
[ 1607.751619] CPU: 27 PID: 51385 Comm: test_write Not tainted 6.1.0-rc1-dgc+ #1618
[ 1607.753307] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1607.755443] RIP: 0010:iomap_page_release+0xaf/0xc0
[ 1607.756496] Code: 48 c1 e8 02 83 e0 01 75 10 38 d0 75 21 5b 4c 89 e7 5d 41 5c e9 d2 be ee ff eb ee e8 bb e0 ec ff eb 88 48 8b 07 5b 5d 41 5c c3 <0f> 0b eb b1 0f 0b eb db 0f 0b eb b2 0f 1f 44 00 00 0f 1f 44 00 00
[ 1607.759643] RSP: 0018:ffffc900087cb5b0 EFLAGS: 00010206
[ 1607.760775] RAX: 0000000000200000 RBX: ffffea0004378000 RCX: 000000000000000c
[ 1607.762181] RDX: 0017ffffc0050081 RSI: ffff888145cb0178 RDI: ffffea0004378000
[ 1607.764408] RBP: 0000000000000200 R08: 0000000000200000 R09: ffffea0004378008
[ 1607.765560] R10: ffff888141988800 R11: 0000000000033900 R12: ffff88810200ee40
[ 1607.766988] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008
[ 1607.768347] FS: 00007f03d5c00640(0000) GS:ffff88833bdc0000(0000) knlGS:0000000000000000
[ 1607.769756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1607.771421] CR2: 00007f03cae00000 CR3: 0000000246226006 CR4: 0000000000770ee0
[ 1607.773298] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1607.775026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1607.777341] PKRU: 55555554
[ 1607.777995] Call Trace:
[ 1607.778637] <TASK>
[ 1607.779372] iomap_release_folio+0x4d/0xa0
[ 1607.780510] shrink_folio_list+0x9dd/0xb10
[ 1607.781931] shrink_lruvec+0x558/0xb80
[ 1607.782922] shrink_node+0x2ce/0x6f0
[ 1607.783993] do_try_to_free_pages+0xd5/0x570
[ 1607.785031] try_to_free_mem_cgroup_pages+0x105/0x220
[ 1607.786728] reclaim_high+0xa6/0xf0
[ 1607.787672] mem_cgroup_handle_over_high+0x97/0x290
[ 1607.789179] try_charge_memcg+0x697/0x7f0
[ 1607.790100] charge_memcg+0x35/0xd0
[ 1607.790930] __mem_cgroup_charge+0x29/0x80
[ 1607.792884] __filemap_add_folio+0x21a/0x460
[ 1607.794342] ? scan_shadow_nodes+0x30/0x30
[ 1607.796052] filemap_add_folio+0x37/0xa0
[ 1607.797036] __filemap_get_folio+0x221/0x360
[ 1607.798468] ? xas_load+0x5/0xa0
[ 1607.799417] iomap_write_begin+0x103/0x580
[ 1607.800908] ? filemap_dirty_folio+0x5c/0x80
[ 1607.801940] ? iomap_write_end+0x101/0x250
[ 1607.803305] iomap_file_buffered_write+0x17f/0x330
[ 1607.804700] xfs_file_buffered_write+0xb1/0x2e0
[ 1607.806240] ? xfs_file_buffered_write+0x2b2/0x2e0
[ 1607.807743] vfs_write+0x2ca/0x3d0
[ 1607.808622] __x64_sys_pwrite64+0x8c/0xc0
[ 1607.810502] do_syscall_64+0x35/0x80
[ 1607.811511] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1607.813116] RIP: 0033:0x7f03d6b40a87
....
[ 1607.925434] kernel BUG at mm/vmscan.c:1306!
[ 1607.926977] invalid opcode: 0000 [#1] PREEMPT SMP
[ 1607.928427] CPU: 8 PID: 51385 Comm: test_write Tainted: G W 6.1.0-rc1-dgc+ #1618
[ 1607.930742] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1607.932716] RIP: 0010:__remove_mapping+0x229/0x240
[ 1607.933648] Code: a9 00 00 08 00 75 27 48 8b 3b 48 81 c7 88 00 00 00 e8 db 64 c1 00 31 c0 e9 d6 fe ff ff 48 8b 3b e8 7c 29 0d 00 e9 27 ff ff ff <0f> 0b 0f 0b 48 8b 45 00 f6 c4 04 74 d0 31 c0 e9 b5 fe ff ff 0f 1f
[ 1607.937079] RSP: 0018:ffffc900087cb5d8 EFLAGS: 00010246
[ 1607.938050] RAX: 0017ffffc0010004 RBX: ffffc900087cb6e8 RCX: ffff888244bc5000
[ 1607.939443] RDX: 0000000000000001 RSI: ffffea0004378000 RDI: ffff888145cb0318
[ 1607.940980] RBP: ffffc900087cb648 R08: ffff888145cb0318 R09: ffffffff813be400
[ 1607.942300] R10: ffff88810200ee40 R11: 0000000000033900 R12: ffffea0004378000
[ 1607.943666] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008
[ 1607.945004] FS: 00007f03d5c00640(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000
[ 1607.946492] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1607.947552] CR2: 00007efd1ae33da8 CR3: 0000000246226001 CR4: 0000000000770ee0
[ 1607.948866] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1607.950250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1607.951738] PKRU: 55555554
[ 1607.952289] Call Trace:
[ 1607.952775] <TASK>
[ 1607.953190] shrink_folio_list+0x5cd/0xb10
[ 1607.954140] shrink_lruvec+0x558/0xb80
[ 1607.954850] shrink_node+0x2ce/0x6f0
[ 1607.955536] do_try_to_free_pages+0xd5/0x570
[ 1607.956397] try_to_free_mem_cgroup_pages+0x105/0x220
[ 1607.957503] reclaim_high+0xa6/0xf0
[ 1607.958205] mem_cgroup_handle_over_high+0x97/0x290
[ 1607.959149] try_charge_memcg+0x697/0x7f0
[ 1607.959932] charge_memcg+0x35/0xd0
[ 1607.960549] __mem_cgroup_charge+0x29/0x80
[ 1607.961257] __filemap_add_folio+0x21a/0x460
[ 1607.961912] ? scan_shadow_nodes+0x30/0x30
[ 1607.962705] filemap_add_folio+0x37/0xa0
[ 1607.963354] __filemap_get_folio+0x221/0x360
[ 1607.964076] ? xas_load+0x5/0xa0
[ 1607.964761] iomap_write_begin+0x103/0x580
[ 1607.965510] ? filemap_dirty_folio+0x5c/0x80
[ 1607.966349] ? iomap_write_end+0x101/0x250
[ 1607.967185] iomap_file_buffered_write+0x17f/0x330
[ 1607.968133] xfs_file_buffered_write+0xb1/0x2e0
[ 1607.968948] ? xfs_file_buffered_write+0x2b2/0x2e0
[ 1607.969783] vfs_write+0x2ca/0x3d0
[ 1607.970326] __x64_sys_pwrite64+0x8c/0xc0
[ 1607.971073] do_syscall_64+0x35/0x80
[ 1607.971732] entry_SYSCALL_64_after_hwframe+0x63/0xcd

which is:

1297 * Same as remove_mapping, but if the folio is removed from the mapping, it
1298 * gets returned with a refcount of 0.
1299 */
1300 static int __remove_mapping(struct address_space *mapping, struct folio *folio,
1301 bool reclaimed, struct mem_cgroup *target_memcg)
1302 {
1303 int refcount;
1304 void *shadow = NULL;
1305
1306 BUG_ON(!folio_test_locked(folio));
1307 BUG_ON(mapping != folio_mapping(folio));
1308

tripping over an unlocked folio in shrink_folio_list()....

And, finally, there's also a process spinning on a cpu in
truncate_inode_pages_range():

[ 3468.057406] CPU: 1 PID: 3579 Comm: test_write Tainted: G D W 6.1.0-rc1-dgc+ #1618
[ 3468.057407] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 3468.057408] RIP: 0010:truncate_inode_pages_range+0x3c0/0x470
[ 3468.057411] Code: 8b 7c e5 00 41 f6 c6 01 0f 85 62 ff ff ff be a9 03 00 00 48 c7 c7 b8 62 7b 82 e8 0b 46 e9 ff 2e 2e 2e 31 c0 f0 49 0f ba 7
[ 3468.057412] RSP: 0018:ffffc90002cffd40 EFLAGS: 00000246
[ 3468.057414] RAX: 0000000000000000 RBX: ffffc90002cffdd8 RCX: ffff8882461b8000
[ 3468.057415] RDX: 0000000000000000 RSI: 00000000000003a9 RDI: ffffffff827b62b8
[ 3468.057415] RBP: ffff888246eee718 R08: ffffffffffffffc0 R09: fffffffffffffffe
[ 3468.057416] R10: 0000000000002838 R11: 0000000000002834 R12: 0000000000000004
[ 3468.057417] R13: ffffc90002cffd60 R14: ffffea0004378000 R15: 0000000000002810
[ 3468.057418] FS: 00007f03d6953740(0000) GS:ffff88813bc40000(0000) knlGS:0000000000000000
[ 3468.057420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3468.057421] CR2: 00007f03d0100000 CR3: 0000000245e2c006 CR4: 0000000000770ee0
root@test4:~# [ 3468.057422] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3468.057422] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3468.057423] PKRU: 55555554
[ 3468.057424] Call Trace:
[ 3468.057424] <TASK>
[ 3468.057426] evict+0x1ad/0x1c0
[ 3468.057429] do_unlinkat+0x1db/0x2e0
[ 3468.057432] __x64_sys_unlinkat+0x33/0x60
[ 3468.057434] do_syscall_64+0x35/0x80
[ 3468.057437] entry_SYSCALL_64_after_hwframe+0x63/0xcd

So that's not the problem, either...

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx