Re: [PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

From: Ryan Roberts
Date: Mon Apr 08 2024 - 06:40:02 EST


On 08/04/2024 11:24, David Hildenbrand wrote:
> On 08.04.24 12:07, Ryan Roberts wrote:
>> On 08/04/2024 10:43, David Hildenbrand wrote:
>>>
>>>>>> +
>>>>>> +/**
>>>>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>>>>>> + * @start_ptep: Page table pointer for the first entry.
>>>>>> + * @max_nr: The maximum number of table entries to consider.
>>>>>> + * @entry: Swap entry recovered from the first table entry.
>>>>>> + *
>>>>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>>>>>> + * containing swap entries all with consecutive offsets and targeting the
>>>>>> same
>>>>>> + * swap type.
>>>>>> + *
>>>>>
>>>>> Likely you should document that any swp pte bits are ignored? ()
>>>>
>>>> Sorry I don't understand this comment. I thought any non-none, non-present PTE
>>>> was always considered to contain only a "swap entry" and a swap entry consists
>>>> of a "type" and an "offset" only. (and its a special "non-swap" swap entry if
>>>> type > SOME_CONSTANT) Are you saying there are additional fields in the PTE
>>>> that
>>>> are not part of the swap entry?
>>>
>>>
>>> pte_swp_soft_dirty()
>>> pte_swp_clear_exclusive()
>>> pte_swp_uffd_wp()
>>>
>>> Are PTE bits used for swp PTE.
>>
>> Ahh wow. mind blown. Looks like a massive hack... why not store them in the
>> arch-independent swap entry, rather than have them squat independently in the
>> PTE?
>
> I think that was discussed at some point, but it not only requires quite some
> churn to change it (all that swp entry code is a mess), these bits are
> conceptually really per PTE and not something you would want to pass into actual
> swap handling code that couldn't care less about all of these.
>
> I looked at this when I added SWP exclusive, but accidentally losing the SWP
> exclusive marker when converting back and forth made me go the PTE route instead.
>
> Then, the available PTE bits are a bit scattered on some architectures, and
> converting entry<->PTE gets even uglier if we don't want to "lose" available bits.
>
> Probably the whole "unsigned long swp_entry" stuff should be replaced by a
> proper struct where we could more easily add flags and have the arch code handle
> the conversion to the PTE just once. The arch-specific swp_entry stuff is
> another nasty thing IMHO.

Yep understood. I'll file this under "there be dragons". Thanks for the explanation.

>
>>
>> OK, my implementation is buggy. I'll re-spin to fix this.
>>
>>
>>>
>>> There is also dirty/young for migration entries, but that's not of a concern
>>> here, because we stop for non_swap_entry().
>>
>> Looks like these are part of the offset field in the arch-independent swap entry
>> - much cleaner ;-).
>
> Note that it only applies to migration entries, and only when we have some spare
> bits due to PFN < offset.

Yep got it. Thanks!