Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

From: Miaohe Lin
Date: Tue Apr 13 2021 - 23:27:39 EST


On 2021/4/14 11:07, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
>
>> On 2021/4/13 9:27, Huang, Ying wrote:
>>> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
>>>
>>>> When I was investigating the swap code, I found the below possible race
>>>> window:
>>>>
>>>> CPU 1 CPU 2
>>>> ----- -----
>>>> do_swap_page
>>>> synchronous swap_readpage
>>>> alloc_page_vma
>>>> swapoff
>>>> release swap_file, bdev, or ...
>>>> swap_readpage
>>>> check sis->flags is ok
>>>> access swap_file, bdev...[oops!]
>>>> si->flags = 0
>>>>
>>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>>> time. And this race may not be really pernicious because swapoff is usually
>>>> done when system shutdown only. To reduce the performance overhead on the
>>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>>> this race window(as suggested by Huang, Ying).
>>>>
>>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>>>
>>> This isn't the commit that introduces the race. You can use `git blame`
>>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>
>> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between
>> swapoff and some swap operations"). And I think this commit does not fix the race
>> condition completely, so I reuse the Fixes tag inside it.
>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx>
>>>> ---
>>>> include/linux/swap.h | 2 +-
>>>> mm/memory.c | 10 ++++++++++
>>>> mm/swapfile.c | 28 +++++++++++-----------------
>>>> 3 files changed, 22 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 849ba5265c11..9066addb57fd 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>>
>>>> static inline void put_swap_device(struct swap_info_struct *si)
>>>> {
>>>> - rcu_read_unlock();
>>>> + percpu_ref_put(&si->users);
>>>> }
>>>>
>>>> #else /* CONFIG_SWAP */
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index cc71a445c76c..8543c47b955c 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>> {
>>>> struct vm_area_struct *vma = vmf->vma;
>>>> struct page *page = NULL, *swapcache;
>>>> + struct swap_info_struct *si = NULL;
>>>> swp_entry_t entry;
>>>> pte_t pte;
>>>> int locked;
>>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>> }
>>>>
>>>>
>>>
>>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>>>
>>> /* Prevent swapoff from happening to us */
>>
>> Ok.
>>
>>>
>>>> + si = get_swap_device(entry);
>>>> + /* In case we raced with swapoff. */
>>>> + if (unlikely(!si))
>>>> + goto out;
>>>> +
>>>
>>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>>> now. We can remove several get/put_swap_device() for function called by
>>> do_swap_page(). That can be another optimization patch.
>>
>> I tried to remove several get/put_swap_device() for function called
>> by do_swap_page() only before I send this series. But it seems they have
>> other callers without proper get/put_swap_device().
>
> Then we need to revise these callers instead. Anyway, can be another
> series.

Yes. can be another series.
Thanks.

>
> Best Regards,
> Huang, Ying
>
> .
>