Re: [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork()

From: Ryan Roberts
Date: Tue Dec 05 2023 - 06:31:11 EST

Next message: Stephan Gerhold: "Re: [PATCH] arm64: dts: qcom: msm8916/39-samsung-a2015: Add PMIC and charger"
Previous message: Christian Brauner: "Re: [PATCH 1/2] Allow a kthread to declare that it calls task_work_run()"
In reply to: David Hildenbrand: "Re: [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork()"
Next in thread: David Hildenbrand: "Re: [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 04/12/2023 17:27, David Hildenbrand wrote:
>>
>> With rmap batching from [1] -- rebased+changed on top of that -- we could turn
>> that into an effective (untested):
>>
>>           if (page && folio_test_anon(folio)) {
>> +               nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, end,
>> +                                               pte, enforce_uffd_wp, &nr_dirty,
>> +                                               &nr_writable);
>>                   /*
>>                    * If this page may have been pinned by the parent process,
>>                    * copy the page immediately for the child so that we'll always
>>                    * guarantee the pinned page won't be randomly replaced in the
>>                    * future.
>>                    */
>> -               folio_get(folio);
>> -               if (unlikely(folio_try_dup_anon_rmap_pte(folio, page,
>> src_vma))) {
>> +               folio_ref_add(folio, nr);
>> +               if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, nr,
>> src_vma))) {
>>                           /* Page may be pinned, we have to copy. */
>> -                       folio_put(folio);
>> -                       return copy_present_page(dst_vma, src_vma, dst_pte,
>> src_pte,
>> -                                                addr, rss, prealloc, page);
>> +                       folio_ref_sub(folio, nr);
>> +                       ret = copy_present_page(dst_vma, src_vma, dst_pte,
>> +                                               src_pte, addr, rss, prealloc,
>> +                                               page);
>> +                       return ret == 0 ? 1 : ret;
>>                   }
>> -               rss[MM_ANONPAGES]++;
>> +               rss[MM_ANONPAGES] += nr;
>>           } else if (page) {
>> -               folio_get(folio);
>> -               folio_dup_file_rmap_pte(folio, page);
>> -               rss[mm_counter_file(page)]++;
>> +               nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, end,
>> +                                               pte, enforce_uffd_wp, &nr_dirty,
>> +                                               &nr_writable);
>> +               folio_ref_add(folio, nr);
>> +               folio_dup_file_rmap_ptes(folio, page, nr);
>> +               rss[mm_counter_file(page)] += nr;
>>           }
>>
>>
>> We'll have to test performance, but it could be that we want to specialize
>> more on !folio_test_large(). That code is very performance-sensitive.
>>
>>
>> [1] https://lkml.kernel.org/r/20231204142146.91437-1-david@xxxxxxxxxx
>
> So, on top of [1] without rmap batching but with a slightly modified version of

Can you clarify what you mean by "without rmap batching"? I thought [1]
implicitly adds rmap batching? (e.g. folio_dup_file_rmap_ptes(), which you've
added in the code snippet above).

> yours (that keeps the existing code structure as pointed out and e.g., updates
> counter updates), running my fork() microbenchmark with a 1 GiB of memory:
>
> Compared to [1], with all order-0 pages it gets 13--14% _slower_ and with all
> PTE-mapped THP (order-9) it gets ~29--30% _faster_.

What test are you running - I'd like to reproduce if possible, since it sounds
like I've got some work to do to remove the order-0 regression.

>
> So looks like we really want to have a completely seprate code path for
> "!folio_test_large()" to keep that case as fast as possible. And "Likely" we
> want to use "likely(!folio_test_large()". ;)

Yuk, but fair enough. If I can repro the perf numbers, I'll have a go a
reworking this.

I think you're also implicitly suggesting that this change needs to depend on
[1]? Which is a shame...

I guess I should also go through a similar exercise for patch 2 in this series.

>
> Performing rmap batching on top of that code only slightly (another 1% or so)
> improves performance in the PTE-mapped THP (order-9) case right now, in contrast
> to other rmap batching. Reason is as all rmap code gets inlined here and we're
> only doing subpage mapcount updates + PAE handling.
>

Next message: Stephan Gerhold: "Re: [PATCH] arm64: dts: qcom: msm8916/39-samsung-a2015: Add PMIC and charger"
Previous message: Christian Brauner: "Re: [PATCH 1/2] Allow a kthread to declare that it calls task_work_run()"
In reply to: David Hildenbrand: "Re: [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork()"
Next in thread: David Hildenbrand: "Re: [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]