Re: Hard and soft lockups with FIO and LTP runs on a large system

From: Bharata B Rao
Date: Mon Jul 15 2024 - 01:20:16 EST


On 11-Jul-24 11:13 AM, Bharata B Rao wrote:
On 09-Jul-24 11:28 AM, Yu Zhao wrote:
On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@xxxxxxx> wrote:

On 08-Jul-24 9:47 PM, Yu Zhao wrote:
On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@xxxxxxx> wrote:

Hi Yu Zhao,

Thanks for your patches. See below...

On 07-Jul-24 4:12 AM, Yu Zhao wrote:
Hi Bharata,

On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@xxxxxxx> wrote:

<snip>

Some experiments tried
======================
1) When MGLRU was enabled many soft lockups were observed, no hard
lockups were seen for 48 hours run. Below is once such soft lockup.

This is not really an MGLRU issue -- can you please try one of the
attached patches? It (truncate.patch) should help with or without
MGLRU.

With truncate.patch and default LRU scheme, a few hard lockups are seen.

Thanks.

In your original report, you said:

    Most of the times the two contended locks are lruvec and
    inode->i_lock spinlocks.
    ...
    Often times, the perf output at the time of the problem shows
    heavy contention on lruvec spin lock. Similar contention is
    also observed with inode i_lock (in clear_shadow_entry path)

Based on this new report, does it mean the i_lock is not as contended,
for the same path (truncation) you tested? If so, I'll post
truncate.patch and add reported-by and tested-by you, unless you have
objections.

truncate.patch has been tested on two systems with default LRU scheme
and the lockup due to inode->i_lock hasn't been seen yet after 24 hours run.

Thanks.


The two paths below were contended on the LRU lock, but they already
batch their operations. So I don't know what else we can do surgically
to improve them.

What has been seen with this workload is that the lruvec spinlock is
held for a long time from shrink_[active/inactive]_list path. In this
path, there is a case in isolate_lru_folios() where scanning of LRU
lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
scanning/skipping of more than 150 million folios were seen. There is
already a comment in there which explains why nr_skipped shouldn't be
counted, but is there any possibility of re-looking at this condition?

For this specific case, probably this can help:

@@ -1659,8 +1659,15 @@ static unsigned long
isolate_lru_folios(unsigned long nr_to_scan,
                 if (folio_zonenum(folio) > sc->reclaim_idx ||
                                 skip_cma(folio, sc)) {
                         nr_skipped[folio_zonenum(folio)] += nr_pages;
-                       move_to = &folios_skipped;
-                       goto move;
+                       list_move(&folio->lru, &folios_skipped);
+                       if (spin_is_contended(&lruvec->lru_lock)) {
+                               if (!list_empty(dst))
+                                       break;
+                               spin_unlock_irq(&lruvec->lru_lock);
+                               cond_resched();
+                               spin_lock_irq(&lruvec->lru_lock);
+                       }
+                       continue;
                 }

Thanks, this helped. With this fix, the test ran for 24hrs without any lockups attributable to lruvec spinlock. As noted in this thread, earlier isolate_lru_folios() used to scan millions of folios and spend a lot of time with spinlock held but after this fix, such a scenario is no longer seen.

However during the weekend mglru-enabled run (with above fix to isolate_lru_folios() and also the previous two patches: truncate.patch and mglru.patch and the inode fix provided by Mateusz), another hard lockup related to lruvec spinlock was observed.

Here is the hardlock up:

watchdog: Watchdog detected hard LOCKUP on cpu 466
CPU: 466 PID: 3103929 Comm: fio Not tainted 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
<NMI>
? show_regs+0x69/0x80
? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
? native_queued_spin_lock_slowpath+0x2b4/0x300
</NMI>
<IRQ>
_raw_spin_lock_irqsave+0x5b/0x70
folio_lruvec_lock_irqsave+0x62/0x90
folio_batch_move_lru+0x9d/0x160
folio_rotate_reclaimable+0xab/0xf0
folio_end_writeback+0x60/0x90
end_buffer_async_write+0xaa/0xe0
end_bio_bh_io_sync+0x2c/0x50
bio_endio+0x108/0x180
blk_mq_end_request_batch+0x11f/0x5e0
nvme_pci_complete_batch+0xb5/0xd0 [nvme]
nvme_irq+0x92/0xe0 [nvme]
__handle_irq_event_percpu+0x6e/0x1e0
handle_irq_event+0x39/0x80
handle_edge_irq+0x8c/0x240
__common_interrupt+0x4e/0xf0
common_interrupt+0x49/0xc0
asm_common_interrupt+0x27/0x40

Here is the lock holder details captured by all-cpu-backtrace:

NMI backtrace for cpu 75
CPU: 75 PID: 3095650 Comm: fio Not tainted 6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:folio_inc_gen+0x142/0x430
Call Trace:
<NMI>
? show_regs+0x69/0x80
? nmi_cpu_backtrace+0xc5/0x130
? nmi_cpu_backtrace_handler+0x11/0x20
? nmi_handle+0x64/0x180
? default_do_nmi+0x45/0x130
? exc_nmi+0x128/0x1a0
? end_repeat_nmi+0xf/0x53
? folio_inc_gen+0x142/0x430
? folio_inc_gen+0x142/0x430
? folio_inc_gen+0x142/0x430
</NMI>
<TASK>
isolate_folios+0x954/0x1630
evict_folios+0xa5/0x8c0
try_to_shrink_lruvec+0x1be/0x320
shrink_one+0x10f/0x1d0
shrink_node+0xa4c/0xc90
do_try_to_free_pages+0xc0/0x590
try_to_free_pages+0xde/0x210
__alloc_pages_noprof+0x6ae/0x12c0
alloc_pages_mpol_noprof+0xd9/0x220
folio_alloc_noprof+0x63/0xe0
filemap_alloc_folio_noprof+0xf4/0x100
page_cache_ra_unbounded+0xb9/0x1a0
page_cache_ra_order+0x26e/0x310
ondemand_readahead+0x1a3/0x360
page_cache_sync_ra+0x83/0x90
filemap_get_pages+0xf0/0x6a0
filemap_read+0xe7/0x3d0
blkdev_read_iter+0x6f/0x140
vfs_read+0x25b/0x340
ksys_read+0x67/0xf0
__x64_sys_read+0x19/0x20
x64_sys_call+0x1771/0x20d0
do_syscall_64+0x7e/0x130

Regards,
Bharata.