Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()
From: Zi Yan
Date: Fri Feb 21 2025 - 18:47:35 EST
On 21 Feb 2025, at 1:17, Baolin Wang wrote:
> On 2025/2/21 10:38, Zi Yan wrote:
>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>
>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>
>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>
>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>
>>>>>>>> Hi Zi,
>>>>>>>>
>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>
>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>
>>>>>>>>
>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>> During shmem_split_large_entry(), large swap entries are covering n slots
>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>
>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by the folio
>>>>>>>>> need to be split and the remaining n-1 shadow entries can be retained with
>>>>>>>>> orders ranging from 0 to n-1. This method only requires
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>
>>>>>>>>> For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>
>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>> xas_try_split() during split.
>>>>>>>>
>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by skipping the swap cache, we will split the large swap entry stored in the shmem mapping into order-0 swap entries, rather than splitting it into other orders of swap entries. This is because the next time we swap in a shmem folio through shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>
>>>>>>> Right. But the swapin is one folio at a time, right? shmem_split_large_entry()
>>>>>>
>>>>>> Yes, now we always swapin an order-0 folio from the async swap device at a time. However, for sync swap device, we will skip the swapcache and swapin the whole large folio by commit 1dd44c0af4fa, so it will not call shmem_split_large_entry() in this case.
>>>>
>>>> Got it. I will check the commit.
>>>>
>>>>>>
>>>>>>> should split the large swap entry and give you a slot to store the order-0 folio.
>>>>>>> For example, with an order-9 large swap entry, to swap in first order-0 folio,
>>>>>>> the large swap entry will become order-0, order-0, order-1, order-2,… order-8,
>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can be used.
>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 will be used.
>>>>>>
>>>>>> Yes, understood. However, for the sequential swapin scenarios, where originally only one split operation is needed. However, your approach increases the number of split operations. Of course, I understand that in non-sequential swapin scenarios, your patch will save some xarray memory. It might be necessary to evaluate whether the increased split operations will have a significant impact on the performance of sequential swapin?
>>>>
>>>> Is there a shmem swapin test I can run to measure this? xas_try_split() should
>>>> performance similar operations as existing xas_split_alloc()+xas_split().
>
> I think a simple sequential swapin case is enough? Anyway I can help to evaluate the performance impact with your new patch.
>
>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all swap entries
>>>>>>> are order-0, which can lead to issues. There should be some check like
>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() should
>>>>>>> be used.
>>>>>>>>
>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem folios, however, my test hung, and the console was continuously filled with the following information. It seems there are some issues with shmem swapin handling. Anyway, I need more time to debug and test.
>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not allocate
>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>
>>>>>> I encountered some issues while testing order 4 and order 6 swapin with your patches. And I roughly reviewed the patch, and it seems that the new swap entry stored in the shmem mapping was not correctly updated after the split.
>>>>>>
>>>>>> The following logic is to reset the swap entry after split, and I assume that the large swap entry is always split to order 0 before. As your patch suggests, if a non-uniform split is used, then the logic for resetting the swap entry needs to be changed? Please correct me if I missed something.
>>>>>>
>>>>>> /*
>>>>>> * Re-set the swap entry after splitting, and the swap
>>>>>> * offset of the original large entry must be continuous.
>>>>>> */
>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>> pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>> swp_entry_t tmp;
>>>>>>
>>>>>> tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>> __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>> swp_to_radix_entry(tmp), 0);
>>>>>> }
>>>>
>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>
>>>>>
>>>>> In addition, after your patch, the shmem_split_large_entry() seems always return 0 even though it splits a large swap entry, but we still need re-calculate the swap entry value after splitting, otherwise it may return errors due to shmem_confirm_swap() validation failure.
>>>>>
>>>>> /*
>>>>> * If the large swap entry has already been split, it is
>>>>> * necessary to recalculate the new swap entry based on
>>>>> * the old order alignment.
>>>>> */
>>>>> if (split_order > 0) {
>>>>> pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>
>>>>> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>> }
>>>>
>>>> Got it. I will fix it.
>>>>
>>>> BTW, do you mind sharing your swapin tests so that I can test my new version
>>>> properly?
>>>
>>> The diff below adjusts the swp_entry_t and returns the right order after
>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>
>> Fixed the compilation error. It will be great if you can share a swapin test, so that
>> I can test locally. Thanks.
>
> Sure. I've attached 3 test shmem swapin cases to see if they can help you with testing. I will also find time next week to review and test your patch.
>
> Additionally, you can use zram as a swap device and disable the skipping swapcache feature to test the split logic quickly:
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 745f130bfb4c..7374d5c1cdde 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2274,7 +2274,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> folio = swap_cache_get_folio(swap, NULL, 0);
> if (!folio) {
> int order = xa_get_order(&mapping->i_pages, index);
> - bool fallback_order0 = false;
> + bool fallback_order0 = true;
> int split_order;
>
> /* Or update major stats only when swapin succeeds?? */
Thank you for the testing programs and the patch above. With zswap enabled,
I do not see any crash. I also tried to mount a tmpfs, dd a file that
is larger than total memory, and read the file out. The system crashed
with my original patch but no longer crashes with my fix.
In terms of performance, I used your shmem_aligned_swapin.c and increased
the shmem size from 1GB to 10GB and measured the time of memset at the
end, which swaps in memory and triggers split large entry. I see no difference
between with and without my patch.
I will wait for your results to confirm my fix. Really appreciate your help.
BTW, without zswap, it seems that madvise(MADV_PAGEOUT) does not write
shmem to swapfile and during swapin, swap_cache_get_folio() always gets
a folio. I wonder what is the difference between zswap and a swapfile.
--
Best Regards,
Yan, Zi