Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

From: Ryan Roberts
Date: Mon Nov 27 2023 - 04:24:40 EST


On 27/11/2023 05:54, Barry Song wrote:
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> + pte_t *dst_pte, pte_t *src_pte,
>> + unsigned long addr, unsigned long end,
>> + int *rss, struct folio **prealloc)
>> {
>> struct mm_struct *src_mm = src_vma->vm_mm;
>> unsigned long vm_flags = src_vma->vm_flags;
>> pte_t pte = ptep_get(src_pte);
>> struct page *page;
>> struct folio *folio;
>> + int nr = 1;
>> + bool anon;
>> + bool any_dirty = pte_dirty(pte);
>> + int i;
>>
>> page = vm_normal_page(src_vma, addr, pte);
>> - if (page)
>> + if (page) {
>> folio = page_folio(page);
>> - if (page && folio_test_anon(folio)) {
>> - /*
>> - * If this page may have been pinned by the parent process,
>> - * copy the page immediately for the child so that we'll always
>> - * guarantee the pinned page won't be randomly replaced in the
>> - * future.
>> - */
>> - folio_get(folio);
>> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> - /* Page may be pinned, we have to copy. */
>> - folio_put(folio);
>> - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> - addr, rss, prealloc, page);
>> + anon = folio_test_anon(folio);
>> + nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> + end, pte, &any_dirty);
>
> in case we have a large folio with 16 CONTPTE basepages, and userspace
> do madvise(addr + 4KB * 5, DONTNEED);

nit: if you are offsetting by 5 pages from addr, then below I think you mean
page0~page4 and page6~15?

>
> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> will return 15. in this case, we should copy page0~page3 and page5~page15.

No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
not how its intended to work. The function is scanning forwards from the current
pte until it finds the first pte that does not fit in the batch - either because
it maps a PFN that is not contiguous, or because the permissions are different
(although this is being relaxed a bit; see conversation with DavidH against this
same patch).

So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
(page0~page4) then the next time through the loop we will go through the
!present path and process the single swap marker. Then the 3rd time through the
loop folio_nr_pages_cont_mapped() will return 10.

Thanks,
Ryan

>
> but the current code is copying page0~page14, right? unless we are immediatly
> split_folio to basepages in zap_pte_range(), we will have problems?
>
>> +
>> + for (i = 0; i < nr; i++, page++) {
>> + if (anon) {
>> + /*
>> + * If this page may have been pinned by the
>> + * parent process, copy the page immediately for
>> + * the child so that we'll always guarantee the
>> + * pinned page won't be randomly replaced in the
>> + * future.
>> + */
>> + if (unlikely(page_try_dup_anon_rmap(
>> + page, false, src_vma))) {
>> + if (i != 0)
>> + break;
>> + /* Page may be pinned, we have to copy. */
>> + return copy_present_page(
>> + dst_vma, src_vma, dst_pte,
>> + src_pte, addr, rss, prealloc,
>> + page);
>> + }
>> + rss[MM_ANONPAGES]++;
>> + VM_BUG_ON(PageAnonExclusive(page));
>> + } else {
>> + page_dup_file_rmap(page, false);
>> + rss[mm_counter_file(page)]++;
>> + }
>
> Thanks
> Barry
>