Re: [PATCH] mm,shmem: Fix a typo in shmem_swapin_page()

From: Huang, Ying
Date: Wed Aug 04 2021 - 04:59:53 EST


Hugh Dickins <hughd@xxxxxxxxxx> writes:

> On Wed, 28 Jul 2021, huang ying wrote:
>> On Sat, Jul 24, 2021 at 4:23 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
>> >
>> > I was wary because, if the (never observed) race to be fixed is in
>> > swap_cluster_readahead(), why was shmem_swapin_page() being patched?
>>
>> When we get a swap entry from the page table or shmem xarray, and no
>> necessary lock is held to prevent the swap device to be swapoff (e.g.
>> page table lock, page lock, etc.), it's possible that the swap device
>> has been swapoff when we operate on the swap entry (e.g. swapin).
>
> Yes. And even without any swapoff, the swap entry may no longer be
> right by the time we go to swap it in, or after it has been swapped in.

Yes.

> Both mm/memory.c and mm/shmem.c have done an unlocked lookup to get
> the swap entry, and have to do a pte_same() or shmem_confirm_swap()
> later, to ensure that what they've got is still right. That lockless
> lookup and raciness is intentional, and has been working for years.

Yes.

>> So we need to find a way to prevent the swap device to be swapoff,
>> get_swap_device() based on percpu_ref is used for that.
>
> To prevent swapoff, no, it would be bad if swapoff could be prevented
> indefinitely. But I think you're saying to prevent swapoff from
> completing - reaching the point of freeing structures which might
> still be being examined.

Yes.

> Yes, though I thought that was already prevented. But I may well have
> missed the inode_read_congested() case which came in two years ago
> (affecting shmem swapin only). And I'll admit now to never(?) having
> studied or tested the SWP_SYNCHRONOUS_IO case in do_swap_page() (with
> suspiciously no equivalent in shmem swapin): I'd better start.
>
> I do dislike the way SWP_SYNCHRONOUS_IO imported low-level swap business
> into do_swap_page(): I'd love to try to move that into swap_state.c, and
> in doing so hopefully get to appreciate it better (and the suggested
> swapoff race: I presume it comes from skipping swapcache_prepare()).

Yes. I think so too.

> But I have no time for that at present.
>
>> To avoid to
>> call get_swap_device() here and there (e.g. now it is called in many
>> different places), I think it's better to call get_swap_device() when
>> we just get a swap entry without holding the necessary lock, that is,
>> in do_swap_page() and shmem_swapin_page(), etc. So that we can delete
>> the get_swap_device() call in lookup_swap_cache(),
>> __read_swap_cache_async(), etc. This will make it easier to
>> understand when to use get_swap_device() and clean up the code. Do
>> you agree?
>
> Offhand I'd say that I do not agree: but I can easily imagine coming to
> agree with you, once I have tried to do better and discovered I cannot.
>
> I dislike the way swap internals are being pushed out to the higher
> levels. Particularly since those higher levels already have to deal
> with similar races which are not protected by get_swap_device().
>
> I'd forgotten how earlier you found you had to add get_swap_device()
> into lookup_swap_cache(), and friends: and can see the attraction of
> doing it once instead of here and there. But it is still there in
> lookup_swap_cache() etc, so that's a poor argument for these commits.

I have a patch in hand to delete get_swap_device() in
lookup_swap_cache() and other places. In fact, the bug fixed in this
patch is found when developing that patch.

An untested initial version of that cleanup patch is as below, I am
still working on it. In that patch, 6 get_swap_device() are deleted,
while 4 get_swap_device() are added. What matters more is that it's
more clear about when to call get_swap_device(). That is, when you get
a swap entry without necessary lock and want to operate on it.

>> > Not explained in its commit message, probably a misunderstanding of
>> > how mm/shmem.c already manages races (and prefers not to be involved
>> > in swap_info_struct stuff).
>>
>> Yes. The commit message isn't clean enough about why we do that.
>
> Thanks for clearing that up. In intervening days I did read about 50
> of the ~100 mails on these commits, April and later (yes, I was Cc'ed
> throughout, thanks, but that didn't give me the time). I've not yet
> reached any that explain the get_swap_device() placement, perhaps
> I'll change my mind when eventually I read those.

I will write about that in the patch to cleanup get_swap_device() calling.

>>
>> > But why do I now say it's bad? Because even if you correct the EINVAL
>> > to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
>> > not surprising, -ENOSPC can need consideration, but -EIO and anything
>> > else just end up as SIGBUS when faulting (or as error from syscall).
>>
>> Yes. -EINVAL isn't a good choice. If it's the swapoff race, then
>> retrying can fix the race, so -EAGAIN may be a choice. But if the
>> swap entry is really invalid (almost impossible in theory), we may
>> need something else, for example, WARN_ON_ONCE() and SIGBUS? This
>> reminds me that we may need to distinguish the two possibilities in
>> get_swap_device()?
>
> Ah, I guess in that last sentence you're thinking of what I realized
> in writing previous mail, behaviour when given a corrupted swap entry.

Yes. But because that should be impossible even in theory, I think we
can ignore it for now to avoid to make the code even more complicated?
Now, some corrupted swap entry can be identified and a error message can
be printed in get_swap_device().

Best Regards,
Huang, Ying

----------------------------------8<--------------------------------------