Re: [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud()

From: Alistair Popple
Date: Wed Jan 15 2025 - 01:38:38 EST


On Tue, Jan 14, 2025 at 05:22:15PM +0100, David Hildenbrand wrote:
> On 10.01.25 07:00, Alistair Popple wrote:
> > Currently DAX folio/page reference counts are managed differently to
> > normal pages. To allow these to be managed the same as normal pages
> > introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio
> > and take references as it would for a normally mapped page.
> >
> > This is distinct from the current mechanism, vmf_insert_pfn_pud, which
> > simply inserts a special devmap PUD entry into the page table without
> > holding a reference to the page for the mapping.
> >
> > Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx>
>
> [...]
>
> > +/**
> > + * vmf_insert_folio_pud - insert a pud size folio mapped by a pud entry
> > + * @vmf: Structure describing the fault
> > + * @folio: folio to insert
> > + * @write: whether it's a write fault
> > + *
> > + * Return: vm_fault_t value.
> > + */
> > +vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > + unsigned long addr = vmf->address & PUD_MASK;
> > + pud_t *pud = vmf->pud;
> > + struct mm_struct *mm = vma->vm_mm;
> > + spinlock_t *ptl;
> > +
> > + if (addr < vma->vm_start || addr >= vma->vm_end)
> > + return VM_FAULT_SIGBUS;
> > +
> > + if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
> > + return VM_FAULT_SIGBUS;
> > +
> > + ptl = pud_lock(mm, pud);
> > + if (pud_none(*vmf->pud)) {
> > + folio_get(folio);
> > + folio_add_file_rmap_pud(folio, &folio->page, vma);
> > + add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
> > + }
> > + insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)), write);
>
> This looks scary at first (inserting something when not taking a reference),
> but insert_pfn_pud() seems to handle that. A comment here would have been
> nice.

Indeed, I will add one.

> It's weird, though, that if there is already something else, that we only
> WARN but don't actually return an error. So ...

Note we only WARN when there is already a mapping there and we're trying to
upgrade it to writeable. This just mimics the logic which currently exists in
insert_pfn() and insert_pfn_pmd().

The comment in insert_pfn() sheds more light:

/*
* For read faults on private mappings the PFN passed
* in may not match the PFN we have mapped if the
* mapped PFN is a writeable COW page. In the mkwrite
* case we are creating a writable PTE for a shared
* mapping and we expect the PFNs to match. If they
* don't match, we are likely racing with block
* allocation and mapping invalidation so just skip the
* update.
*/

> > + spin_unlock(ptl);
> > +
> > + return VM_FAULT_NOPAGE;
>
> I assume always returning VM_FAULT_NOPAGE, even when something went wrong,
> is the right thing to do?

Yes, I think so. I guess in the WARN case we could return something like
VM_FAULT_SIGBUS to kill the application, but the existing vmf_insert_*()
functions don't currently do that so I think that would be a separate clean-up.

> Apart from that LGTM.
>
>
> --
> Cheers,
>
> David / dhildenb
>