Re: [PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant operations

From: Gavin Shan
Date: Wed Apr 09 2025 - 23:07:11 EST


Hi Xavier,

On 4/10/25 12:53 PM, Xavier wrote:
At 2025-04-10 08:58:46, "Gavin Shan" <gshan@xxxxxxxxxx> wrote:

On 4/10/25 1:10 AM, Xavier wrote:

Thank you for carefully reviewing this patch and raising your questions.
I'll try to explain and answer them below.


Not a problem :)


At 2025-04-09 12:09:48, "Gavin Shan" <gshan@xxxxxxxxxx> wrote:
Hi Xavier,

On 4/8/25 6:58 PM, Xavier wrote:
This commit optimizes the contpte_ptep_get function by adding early
termination logic. It checks if the dirty and young bits of orig_pte
are already set and skips redundant bit-setting operations during
the loop. This reduces unnecessary iterations and improves performance.

Signed-off-by: Xavier <xavier_qy@xxxxxxx>
---
arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..034d153d7d19 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(__contpte_try_unfold);

I'm wandering how it can work. More details are given below.

+#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
+ do { \
+ int _start = start; \
+ pte_t *_ptep = ptep; \
+ while (_start++ < CONT_PTES) { \
+ if (pte_##flag(__ptep_get(_ptep++))) { \
+ orig_pte = pte_mk##flag(orig_pte); \
+ break; \
+ } \
+ } \
+ } while (0)
+

nit: the variable _start and _ptep can be dropped since the caller is going to
bail after CHECK_CONTPTE_FLAG(). However, I'm wandering it's going to introduce
burden to contpte_ptep_get() for its readability, much more complex than I
thought.

Not adding these two variables in the current code has no impact. The main purpose
of adding them is to improve maintainability and prevent the external code from
continuing operations unknowingly after the macro has modified the variables.



CONT_PTES is 16 with the assumption of 4KB base page size. CHECK_CONTPTE_FLAG()
collects the flag for ptep[start, CONT_PTES].

pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
{
/*
@@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
for (i = 0; i < CONT_PTES; i++, ptep++) {
pte = __ptep_get(ptep);
- if (pte_dirty(pte))
+ if (pte_dirty(pte)) {
orig_pte = pte_mkdirty(orig_pte);
+ CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
+ break;
+ }
- if (pte_young(pte))
+ if (pte_young(pte)) {
orig_pte = pte_mkyoung(orig_pte);
+ CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
+ break;
+ }
}
return orig_pte;

There are two issues as I can see: (1) The loop stops on any dirty or young flag. Another
flag can be missed when one flag is seen. For example, when ptep[0] has both dirty/young
flag, only the dirty flag is collected. (2) CHECK_CONTPTE_FLAG() iterates ptep[i, CONT_PTES],
conflicting to the outer loop, which iterates ptep[0, CONT_PTES].

No flags will be missed. The outer loop is used to check for the first flag,
which could be either the dirty or young flag.
Once this flag (let's assume it's the dirty flag) is found in the i-th PTE,
the dirty flag of orig_pte will be set, and the code will immediately enter
the inner loop, namely CHECK_CONTPTE_FLAG. This inner loop will continue
to check only for the young flag starting from the i-th position, and we needn't
concern about the dirty flag anymore.
If CHECK_CONTPTE_FLAG finds the young flag in the j-th PTE, the young flag
of orig_pte will be set. At this point, both the young and dirty flags of
orig_pte have been set, and there's no need for further loop judgments, so
the both the inner and outer loops can be exited directly. This approach
reduces unnecessary repeated traversals and judgments.


Thanks for the explanation. I missed that the subsequent young bit is collected
on pte_dirty(). Similarly, the subsequent dirty bit is collected on pte_young().
Now I can see all (dirty | young) bits are collected with a lost.


Besides, I also doubt how much performance can be gained by bailing early on (dirty | young).
However, it's avoided to cross the L1D cache line boundary if possible. With 4KB base page
size, 128 bytes are needed for ptep[CONT_PTES], equal to two cache lines. If we can bail
early with luck, we don't have to step into another cache line. Note that extra checks needs
more CPU cycles.

Compared to the previous function, this code doesn't add any extra checks.
Even in the worst-case scenario, where neither a dirty nor a young flag is
found among the 16 PTEs, the number of checks is the same as in the original
function. If any flag is found earlier, the optimized patch will reduce the
number of subsequent checks for that flag compared to the original code.


There are 32 checks in the original code (assuming we have 4kb base page size
and CONT_PTES == 16) and the number of checks is fixed. With your patch applied,
the number becomes 32 + (number from CHECK_CONTPTE_FLAG) if all PTEs have
pte_dirty() and none of them has pte_young(), if I don't miss anything here.


If all PTEs are dirty and none are young, the code will enter CHECK_CONTPTE_FLAG
when it encounters the first dirty PTE. Inside CHECK_CONTPTE_FLAG, it only checks
for the young bit (16 - 0) times and then exits all loops. In this case, the total number
of checks is 1 + 16, which is less than the original 32.


You're correct. The outer loop is going to stop on pte_dirty() or pte_young(). Well,
you can see the code becomes hard to be understood, at least to me :)

[...]

Thanks,
Gavin