Re: [RFC PATCH 5/5] mm proc/task_mmu.c: add hugetlb specific routine for clear_refs

From: Mike Kravetz
Date: Thu Feb 18 2021 - 19:15:40 EST


On 2/17/21 12:25 PM, Peter Xu wrote:
> On Wed, Feb 10, 2021 at 04:03:22PM -0800, Mike Kravetz wrote:
>> There was is no hugetlb specific routine for clearing soft dirty and
>> other referrences. The 'default' routines would only clear the
>> VM_SOFTDIRTY flag in the vma.
>>
>> Add new routine specifically for hugetlb vmas.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>> ---
>> fs/proc/task_mmu.c | 110 +++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 110 insertions(+)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 829b35016aaa..f06cf9b131a8 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1116,6 +1116,115 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>> }
>> #endif
>>
>> +#ifdef CONFIG_HUGETLB_PAGE
>> +static inline bool huge_pte_is_pinned(struct vm_area_struct *vma,
>> + unsigned long addr, pte_t pte)
>> +{
>> + struct page *page;
>> +
>> + if (likely(!atomic_read(&vma->vm_mm->has_pinned)))
>> + return false;
>> + page = pte_page(pte);
>> + if (!page)
>> + return false;
>> + return page_maybe_dma_pinned(page);
>> +}
>> +
>> +static int clear_refs_hugetlb_range(pte_t *ptep, unsigned long hmask,
>> + unsigned long addr, unsigned long end,
>> + struct mm_walk *walk)
>> +{
>> + struct clear_refs_private *cp = walk->private;
>> + struct vm_area_struct *vma = walk->vma;
>> + struct hstate *h = hstate_vma(walk->vma);
>> + unsigned long adj_start = addr, adj_end = end;
>> + spinlock_t *ptl;
>> + pte_t old_pte, pte;
>> +
>> + /*
>> + * clear_refs should only operate on complete vmas. Therefore,
>> + * values passed here should be huge page aligned and huge page
>> + * size in length. Quick validation before taking any action in
>> + * case upstream code is changed.
>> + */
>> + if ((addr & hmask) != addr || end - addr != huge_page_size(h)) {
>> + WARN_ONCE(1, "%s passed unaligned address\n", __func__);
>> + return 1;
>> + }
>
> I wouldn't worry too much on the interface change - The one who will change the
> interface should guarantee all existing hooks will still work, isn't it? :)

Yeah, I can drop this.

> It's slightly confusing to me on why "clear_refs should only operate on
> complete vmas" is related to the check, though.

Mostly me thinking that since it is operating on complete (hugetlb) vma,
then we know vms is huge page aligned and a multiple of huge page in size.
So, all passed addressed should be huge page aligned as well.

>
>> +
>> + ptl = huge_pte_lock(hstate_vma(vma), walk->mm, ptep);
>> +
>> + /* Soft dirty and pmd sharing do not mix */
>
> Right, this seems to be a placeholder for unsharing code.

Sorry, comment was left over from earlier code. Unsharing is actually
done below, I forgot to remove comment.

> Though maybe we can do that earlier in pre_vma() hook? That should be per-vma
> rather than handling one specific huge page here, hence more efficient imho.

Yes, let me look into that. The code below is certianly not the most
efficient.

> this reminded me that I should also better move hugetlb_unshare_all_pmds() of
> my other patch into hugetlb.c, so that this code can call it. Currently it's a
> static function in userfaultfd.c.
>
>> +
>> + pte = huge_ptep_get(ptep);
>> + if (!pte_present(pte))
>> + goto out;
>> + if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
>> + goto out;
>> +
>> + if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
>
> Maybe move this check into clear_refs_test_walk()? We can bail out earlier if:
>
> (is_vm_hugetlb_page(vma) && (type != CLEAR_REFS_SOFT_DIRTY))
>

Yes, we can do that. I was patterning this after the other 'clear_refs'
routines. But, they can clear things besides soft dirty. Since soft
dirty is the only thing handled for hugetlb, we can bail earlier.

>> + if (huge_pte_is_pinned(vma, addr, pte))
>> + goto out;
>
> Out of topic of this patchset, but it's definitely a pity that we can't track
> soft dirty for pinned pages. Currently the assumption of the pte code path is:
> "if this page can be DMA written then we won't know whether data changed after
> all, then tracking dirty is meaningless", however that's prone to change when
> new hardwares coming, say, IOMMU could start to trap DMA writes already.
>
> But again that's another story.. and we should just follow what we do with
> non-hugetlbfs for sure here, until some day if we'd like to revive soft dirty
> tracking with pinned pages.
>
>> +
>> + /*
>> + * soft dirty and pmd sharing do not work together as
>> + * per-process is tracked in ptes, and pmd sharing allows
>> + * processed to share ptes. We unshare any pmds here.
>> + */
>> + adjust_range_if_pmd_sharing_possible(vma, &adj_start, &adj_end);
>
> Ideally when reach here, huge pmd sharing won't ever exist, right? Then do we
> still need to adjust the range at all?

Right, we should be able to do it earlier.

Thanks again for taking a look at this.
--
Mike Kravetz

>
> Thanks,
>