Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders

From: Ryan Roberts
Date: Fri Mar 22 2024 - 05:39:47 EST


On 22/03/2024 02:39, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>
>> On 21/03/2024 04:39, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>>>
>>>> Hi Huang, Ying,
>>>>
>>>>
>>>> On 12/03/2024 07:51, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>>>> [...]
>>>>
>>>>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>>> }
>>>>>>
>>>>>> if (si->swap_map[offset]) {
>>>>>> + VM_WARN_ON(order > 0);
>>>>>> unlock_cluster(ci);
>>>>>> if (!n_ret)
>>>>>> goto scan;
>>>>>> else
>>>>>> goto done;
>>>>>> }
>>>>>> - WRITE_ONCE(si->swap_map[offset], usage);
>>>>>> - inc_cluster_info_page(si, si->cluster_info, offset);
>>>>>> + memset(si->swap_map + offset, usage, nr_pages);
>>>>>
>>>>> Add barrier() here corresponds to original WRITE_ONCE()?
>>>>> unlock_cluster(ci) may be NOP for some swap devices.
>>>>
>>>> Looking at this a bit more closely, I'm not sure this is needed. Even if there
>>>> is no cluster, the swap_info is still locked, so unlocking that will act as a
>>>> barrier. There are a number of other callsites that memset(si->swap_map) without
>>>> an explicit barrier and with the swap_info locked.
>>>>
>>>> Looking at the original commit that added the WRITE_ONCE() it was worried about
>>>> a race with reading swap_map in _swap_info_get(). But that site is now annotated
>>>> with a data_race(), which will suppress the warning. And I don't believe there
>>>> are any places that read swap_map locklessly and depend upon observing ordering
>>>> between it and other state? So I think the si unlock is sufficient?
>>>>
>>>> I'm not planning to add barrier() here. Let me know if you disagree.
>>>
>>> swap_map[] may be read locklessly in swap_offset_available_and_locked()
>>> in parallel. IIUC, WRITE_ONCE() here is to make the writing take effect
>>> as early as possible there.
>>
>> Afraid I'm not convinced by that argument; if it's racing, it's racing - the
>
> It's not a race.
>
>> lockless side needs to be robust (it is). Adding the compiler barrier limits the
>> compiler's options which could lead to slower code in this path. If your
>> argument is that you want to reduce the window where
>> swap_offset_available_and_locked() could observe a free swap slot but then see
>> that its taken after it gets the si lock, that seems like a micro-optimization
>> to me, which we should avoid if we can.
>
> Yes. I think that it is a micro-optimization too. I had thought that
> it is a common practice to use WRITE_ONCE()/READ_ONCE() or barrier() in
> intentional racy data accessing to make the change available as soon as
> possible. But I may be wrong here.

My understanding is that WRITE_ONCE() forces the compiler to emit a store at
that point in the program; it can't just rely on knowing that it has previously
written the same value to that location, it can't reorder the load to later in
the program and it must store the whole word atomically so that no tearing can
be observed. But given that swap_map is only ever written with the si lock held,
I don't believe we require the first two of those semantics. It should be enough
to know that the compiler has emitted all the stores (if it determines they are
required) prior to the spin_unlock(). I'm not sure about the anti-tearing
guarrantee.

Happy to be told I'm wrong here!

>
>> By remnoving the WRITE_ONCE() and using memset, the lockless reader could
>> observe tearing though. I don't think that should cause a problem (because
>> everything is rechecked with under the lock), but if we want to avoid it, then
>> perhaps we just need to loop over WRITE_ONCE() here instead of using memset?
>
> IIUC, in practice that isn't necessary, because type of si->swap_map[]
> is "unsigned char". It isn't possible to tear "unsigned char".

In practice, perhaps. But I guess the compiler is free to update the char
bit-by-bit if it wants to, if the store is not marked WRITE_ONCE()?

> In
> theory, it may be better to use WRITE_ONCE() because we may change the
> type of si->swap_map[] at some time (who knows). I don't have a strong
> opinion here.

The way I see it, the precedent is already set; there are a number of places
that already use memset to update swap_map. They are all under the si lock, and
none use barrier(). If its wrong here, then its wrong in those places too, I
believe.

>
>>>>>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>>>>> unlock_cluster(ci);
>
> --
> Best Regards,
> Huang, Ying