Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage

From: Yu Zhao
Date: Tue Mar 30 2021 - 01:48:21 EST


On Mon, Mar 29, 2021 at 9:44 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
>
> > On 2021/3/30 9:57, Huang, Ying wrote:
> >> Hi, Miaohe,
> >>
> >> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
> >>
> >>> Hi all,
> >>> I am investigating the swap code, and I found the below possible race window:
> >>>
> >>> CPU 1 CPU 2
> >>> ----- -----
> >>> do_swap_page
> >>> skip swapcache case (synchronous swap_readpage)
> >>> alloc_page_vma
> >>> swapoff
> >>> release swap_file, bdev, or ...
> >>> swap_readpage
> >>> check sis->flags is ok
> >>> access swap_file, bdev or ...[oops!]
> >>> si->flags = 0
> >>>
> >>> The swapcache case is ok because swapoff will wait on the page_lock of swapcache page.
> >>> Is this will really happen or Am I miss something ?
> >>> Any reply would be really grateful. Thanks! :)
> >>
> >> This appears possible. Even for swapcache case, we can't guarantee the
> >
> > Many thanks for reply!
> >
> >> swap entry gotten from the page table is always valid too. The
> >
> > The page table may change at any time. And we may thus do some useless work.
> > But the pte_same() check could handle these races correctly if these do not
> > result in oops.
> >
> >> underlying swap device can be swapped off at the same time. So we use
> >> get/put_swap_device() for that. Maybe we need similar stuff here.
> >
> > Using get/put_swap_device() to guard against swapoff for swap_readpage() sounds
> > really bad as swap_readpage() may take really long time. Also such race may not be
> > really hurtful because swapoff is usually done when system shutdown only.
> > I can not figure some simple and stable stuff out to fix this. Any suggestions or
> > could anyone help get rid of such race?
>
> Some reference counting on the swap device can prevent swap device from
> swapping-off. To reduce the performance overhead on the hot-path as
> much as possible, it appears we can use the percpu_ref.

Hi,

I've been seeing crashes when testing the latest kernels with
stress-ng --class vm -a 20 -t 600s --temp-path /tmp

I haven't had time to look into them yet:

DEBUG_VM:
BUG: unable to handle page fault for address: ffff905c33c9a000
Call Trace:
get_swap_pages+0x278/0x590
get_swap_page+0x1ab/0x280
add_to_swap+0x7d/0x130
shrink_page_list+0xf84/0x25f0
reclaim_pages+0x313/0x430
madvise_cold_or_pageout_pte_range+0x95c/0xaa0

KASAN:
==================================================================
BUG: KASAN: slab-out-of-bounds in __frontswap_store+0xc9/0x2e0
Read of size 8 at addr ffff88901f646f18 by task stress-ng-mrema/31329
CPU: 2 PID: 31329 Comm: stress-ng-mrema Tainted: G S I L
5.12.0-smp-DEV #2
Call Trace:
dump_stack+0xff/0x165
print_address_description+0x81/0x390
__kasan_report+0x154/0x1b0
? __frontswap_store+0xc9/0x2e0
? __frontswap_store+0xc9/0x2e0
kasan_report+0x47/0x60
kasan_check_range+0x2f3/0x340
__kasan_check_read+0x11/0x20
__frontswap_store+0xc9/0x2e0
swap_writepage+0x52/0x80
pageout+0x489/0x7f0
shrink_page_list+0x1b11/0x2c90
reclaim_pages+0x6ca/0x930
madvise_cold_or_pageout_pte_range+0x1260/0x13a0

Allocated by task 16813:
____kasan_kmalloc+0xb0/0xe0
__kasan_kmalloc+0x9/0x10
__kmalloc_node+0x52/0x70
kvmalloc_node+0x50/0x90
__se_sys_swapon+0x353a/0x4860
__x64_sys_swapon+0x5b/0x70

The buggy address belongs to the object at ffff88901f640000
which belongs to the cache kmalloc-32k of size 32768
The buggy address is located 28440 bytes inside of
32768-byte region [ffff88901f640000, ffff88901f648000)
The buggy address belongs to the page:
page:0000000032d23e33 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x101f640
head:0000000032d23e33 order:4 compound_mapcount:0 compound_pincount:0
flags: 0x400000000010200(slab|head)
raw: 0400000000010200 ffffea00062b8408 ffffea000a6e9008 ffff888100040300
raw: 0000000000000000 ffff88901f640000 0000000100000001 000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88901f646e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88901f646e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88901f646f00: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff88901f646f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88901f647000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

Relevant config options I could think of:

CONFIG_MEMCG_SWAP=y
CONFIG_THP_SWAP=y
CONFIG_ZSWAP=y