Re: general protection fault in wb_timer_fn

From: Yang Shi
Date: Thu Sep 16 2021 - 16:18:22 EST


On Thu, Sep 16, 2021 at 3:47 AM Hao Sun <sunhao.th@xxxxxxxxx> wrote:
>
> > The BUG is triggered if it tries to invalidate across pages. But it
> > hardcoded PAGE_SIZE. The offset passed in by truncate_cleanup_page()
> > is 0, but the length might be > PAGE_SIZE if it is a compound page. It
> > might be caused by READ_ONLY_THP_FOR_FS.
> >
> > Could you please try the below debug patch to dump page details? I saw
> > your kernel config has DEBUG_VM enabled.
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index ab7573d72dd7..ed7256112c2b 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1507,7 +1507,8 @@ void block_invalidatepage(struct page *page,
> > unsigned int offset,
> > /*
> > * Check for overflow
> > */
> > - BUG_ON(stop > PAGE_SIZE || stop < length);
> > + VM_BUG_ON_PAGE((stop > PAGE_SIZE), page);
> > + VM_BUG_ON_PAGE((stop < length), page);
> >
> > head = page_buffers(page);
> > bh = head;
> >
>
> Just patched it.
> The following log was printed after executing the C reproducer.
>
> IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
> IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
> Bluetooth: hci0: command 0x0409 tx timeout
> Bluetooth: hci0: command 0x041b tx timeout
> Bluetooth: hci0: command 0x040f tx timeout
> Bluetooth: hci0: command 0x0419 tx timeout
> page:ffffea00009c0000 refcount:514 mapcount:0 mapping:ffff8881060db7b0
> index:0x0 pfn:0x27000
> head:ffffea00009c0000 order:9 compound_mapcount:0 compound_pincount:0
> memcg:ffff8880118bc000
> aops:def_blk_aops ino:fa00000
> flags: 0xfff00000012037(locked|referenced|uptodate|lru|active|private|head|node=0|zone=1|lastcpupid=0x7ff)
> raw: 00fff00000012037 ffffea00009491c8 ffff888010c7d030 ffff8881060db7b0
> raw: 0000000000000000 ffff888025d10658 00000202ffffffff ffff8880118bc000
> page dumped because: VM_BUG_ON_PAGE((stop > ((1UL) << 12)))
> page_owner tracks the page as allocated
> page last allocated via order 9, migratetype Movable, gfp_mask
> 0x13c24ca(GFP_TRANSHUGE|__GFP_THISNODE), pid 35, ts 579054356699,
> free_ts 543915753195
> set_page_owner include/linux/page_owner.h:31 [inline]
> post_alloc_hook mm/page_alloc.c:2418 [inline]
> prep_new_page+0x1a5/0x240 mm/page_alloc.c:2424
> get_page_from_freelist+0x1f10/0x3b70 mm/page_alloc.c:4153
> __alloc_pages+0x306/0x6e0 mm/page_alloc.c:5375
> __alloc_pages_node include/linux/gfp.h:570 [inline]
> khugepaged_alloc_page+0xa0/0x170 mm/khugepaged.c:881
> collapse_file+0x20a/0x45f0 mm/khugepaged.c:1655
> khugepaged_scan_file mm/khugepaged.c:2051 [inline]
> khugepaged_scan_mm_slot mm/khugepaged.c:2146 [inline]
> khugepaged_do_scan mm/khugepaged.c:2230 [inline]
> khugepaged+0x2e65/0x5c50 mm/khugepaged.c:2275
> kthread+0x3e5/0x4d0 kernel/kthread.c:319
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> page last free stack trace:
> reset_page_owner include/linux/page_owner.h:24 [inline]
> free_pages_prepare mm/page_alloc.c:1338 [inline]
> free_pcp_prepare+0x412/0x900 mm/page_alloc.c:1389
> free_unref_page_prepare mm/page_alloc.c:3315 [inline]
> free_unref_page+0x19/0x580 mm/page_alloc.c:3394
> release_pages+0x87f/0x2920 mm/swap.c:926
> tlb_batch_pages_flush mm/mmu_gather.c:49 [inline]
> tlb_flush_mmu_free mm/mmu_gather.c:242 [inline]
> tlb_flush_mmu+0x8d/0x610 mm/mmu_gather.c:249
> tlb_finish_mmu+0x93/0x3c0 mm/mmu_gather.c:340
> unmap_region+0x27f/0x350 mm/mmap.c:2653
> __do_munmap+0xabc/0x11e0 mm/mmap.c:2884
> do_munmap mm/mmap.c:2895 [inline]
> munmap_vma_range mm/mmap.c:603 [inline]
> mmap_region+0x2c4/0x1340 mm/mmap.c:1742
> do_mmap+0x7f5/0xe60 mm/mmap.c:1575
> vm_mmap_pgoff+0x1b7/0x290 mm/util.c:519
> ksys_mmap_pgoff+0x49f/0x620 mm/mmap.c:1624
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x44/0xae
> ------------[ cut here ]------------
> kernel BUG at fs/buffer.c:1511!
> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 2954 Comm: kworker/1:2 Not tainted 5.15.0-rc1+ #4
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.13.0-1ubuntu1.1 04/01/2014
> Workqueue: events delayed_fput
> RIP: 0010:block_invalidatepage+0x599/0x680 fs/buffer.c:1511
> Code: 0f 0b e8 9a 6b 9c ff 31 f6 4c 89 e7 e8 10 d7 bf ff e9 df fe ff
> ff e8 86 6b 9c ff 48 c7 c6 00 4e 9a 89 4c 89 e7 e8 c7 16 d0 ff <0f> 0b
> e8 70 6b 9c ff 48 c7 c6 60 4e 9a 89 4c 89 e7 e8 b1 16 d0 ff
> RSP: 0018:ffffc9000e9078b8 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff888100bd9c80
> RDX: 0000000000000000 RSI: ffff888100bd9c80 RDI: 0000000000000002
> RBP: 0000000000000000 R08: ffffffff81d9e369 R09: 000000000000ffff
> R10: 0000000000000003 R11: ffffed1026b83f53 R12: ffffea00009c0000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000200000
> FS: 0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffde41e8960 CR3: 0000000019cef000 CR4: 0000000000350ee0
> Call Trace:
> do_invalidatepage mm/truncate.c:157 [inline]
> truncate_cleanup_page+0x3e4/0x620 mm/truncate.c:176
> truncate_inode_pages_range+0x26c/0x1910 mm/truncate.c:325
> kill_bdev.isra.0+0x5f/0x80 block/bdev.c:77
> blkdev_flush_mapping+0xdf/0x2e0 block/bdev.c:658
> blkdev_put_whole+0xe8/0x110 block/bdev.c:689
> blkdev_put+0x23c/0x6f0 block/bdev.c:953
> blkdev_close+0x8d/0xb0 block/fops.c:459
> __fput+0x288/0x9f0 fs/file_table.c:280
> delayed_fput+0x56/0x70 fs/file_table.c:308
> process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
> worker_thread+0x90/0xed0 kernel/workqueue.c:2444
> kthread+0x3e5/0x4d0 kernel/kthread.c:319
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> Modules linked in:
> Dumping ftrace buffer:
> (ftrace buffer empty)
> ---[ end trace 77cfca54575c5255 ]---
> RIP: 0010:block_invalidatepage+0x599/0x680 fs/buffer.c:1511
> Code: 0f 0b e8 9a 6b 9c ff 31 f6 4c 89 e7 e8 10 d7 bf ff e9 df fe ff
> ff e8 86 6b 9c ff 48 c7 c6 00 4e 9a 89 4c 89 e7 e8 c7 16 d0 ff <0f> 0b
> e8 70 6b 9c ff 48 c7 c6 60 4e 9a 89 4c 89 e7 e8 b1 16 d0 ff
> RSP: 0018:ffffc9000e9078b8 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff888100bd9c80
> RDX: 0000000000000000 RSI: ffff888100bd9c80 RDI: 0000000000000002
> RBP: 0000000000000000 R08: ffffffff81d9e369 R09: 000000000000ffff
> R10: 0000000000000003 R11: ffffed1026b83f53 R12: ffffea00009c0000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000200000
> FS: 0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f78dffb7ef8 CR3: 000000001f2ef000 CR4: 0000000000350ee0
>
>
> > If my speculation is correct, I think the below patch should be able
> > to fix this issue.
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index ab7573d72dd7..18428cee59af 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1507,7 +1507,7 @@ void block_invalidatepage(struct page *page,
> > unsigned int offset,
> > /*
> > * Check for overflow
> > */
> > - BUG_ON(stop > PAGE_SIZE || stop < length);
> > + BUG_ON(stop > thp_size(page) || stop < length);
> >
> > head = page_buffers(page);
> > bh = head;
> >
>")
> Yes, the C reproducer can not crash the kernel anymore after patching
> the above code.

Thank you for running the test. This does prove my speculation. It
seems commit eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint
on file-backed THPs") opens much more cases for file THPs.

It seems your test case opens null block device and mmaps with
PROT_EXEC. This is why the THP is collapsed.

The above fix is kind of ad-hoc. The further investigation shows
bigger problem in invalidatepage(). All the implementations are *NOT*
THP aware and hardcoded PAGE_SIZE. Some triggers BUG(), like
block_invalidatepage(), some just returns error if length is >
PAGE_SIZE.

We could convert PAGE_SIZE to thp_size(), but it seems not enough
since the current implementations just invalidate one subpage
(typically head page), but it is not enough since other subpages may
have private too because PG_private is per subpage so there may be
multiple subpages have private IIUC. This may cause the THP not
splittable and reclaimable since the extra refcount pins from private
of subpages prevent this.

I could submit a patch to close the BUG() for now since more work
definitely needs to be done to make all the things right. However, how
to fix this may have conflicts with Willy's page folio work, so this
may not happen at any time soon.