Re: [PATCH v1] mm/contpte: Optimize loop to reduce redundant operations

From: Lance Yang
Date: Tue Apr 08 2025 - 05:19:12 EST


On Tue, Apr 8, 2025 at 12:19 AM Dev Jain <dev.jain@xxxxxxx> wrote:
>
>
> Hi Xavier,
>
> On 07/04/25 7:01 pm, Lance Yang wrote:
> > On Mon, Apr 7, 2025 at 8:56 PM Xavier <xavier_qy@xxxxxxx> wrote:
> >>
> >>
> >>
> >> Hi Lance,
> >>
> >> Thanks for your feedback, my response is as follows.
> >>
> >> --
> >> Thanks,
> >> Xavier
> >>
> >>
> >>
> >>
> >>
> >> At 2025-04-07 19:29:22, "Lance Yang" <ioworker0@xxxxxxxxx> wrote:
> >>> Thanks for the patch. Would the following change be better?
> >>>
> >>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> >>> index 55107d27d3f8..64eb3b2fbf06 100644
> >>> --- a/arch/arm64/mm/contpte.c
> >>> +++ b/arch/arm64/mm/contpte.c
> >>> @@ -174,6 +174,9 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >>>
> >>> if (pte_young(pte))
> >>> orig_pte = pte_mkyoung(orig_pte);
> >>> +
> >>> + if (pte_young(orig_pte) && pte_dirty(orig_pte))
> >>> + break;
> >>> }
>
> Quite the coincidence, I was thinking of doing exactly this some days
> back and testing it out : ) Can you do a microanalysis whether this gets
> us a benefit or not? This looks like an optimization on paper but may
> not be one after all because CONT_PTES is only 16 and a simple loop
> without extra if-conditions may just be faster.

Yeah, I was thinking the same ;)

This change seems reasonable in theory. But with CONT_PTES=16, keeping
it simple might actually be faster, IMO.

Thanks,
Lance

>
> >>>
> >>> return orig_pte;
> >>> --
> >>>
> >>> We can check the orig_pte flags directly instead of using extra boolean
> >>> variables, which gives us an early-exit when both dirty and young flags
> >>> are set.
> >> Your way of writing the code is indeed more concise. However, I think
> >> using boolean variables might be more efficient. Although it introduces
> >> additional variables, comparing boolean values is likely to be more
> >> efficient than checking bit settings.
> >>
> >>>
> >>> Also, is this optimization really needed for the common case?
> >> This function is on a high-frequency execution path. During debugging,
> >> I found that in most cases, the first few pages are already marked as
> >> both dirty and young. But currently, the program still has to complete
> >> the entire loop of 16 ptep iterations, which seriously reduces the efficiency.
> >
> > Hmm... agreed that this patch helps when early PTEs are dirty/young, but
> > for late-ones-only cases, it only introduces overhead with no benefit, IIUC.
> >
> > So, let's wait for folks to take a look ;)
> >
> > Thanks,
> > Lance
> >
> >>>
> >>> Thanks,
> >>> Lance
> >
>