Re: [PATCH v7 4/6] ksm: add pgoff into ksm_rmap_item

From: David Hildenbrand (Arm)

Date: Fri Jun 05 2026 - 13:25:44 EST


On 5/30/26 11:07, xu.xin16@xxxxxxxxxx wrote:
> From: xu xin <xu.xin16@xxxxxxxxxx>

You should wrap you patch description at 75 columns max.

(Documentation/process/submitting-patches.rst)

>
> The reason for adding pgoff to ksm_rmap_item has been discussed in previous
> mailing list threads [1][2]. The main purpose is to allow the KSM reverse mapping
> to obtain the original page's linear page index, so that during anon_vma_tree
> travering, it can conditionally locate the VMAs and avoid scanning the entire
> address space [0, ULONG_MAX].

I suggest

"As preparation for KSM rmap optimizations, let's track the original
linear_page_index() of a de-duplicated page in its ksm_rmap_item, so we can
efficiently search for the page in an address space, avoiding scanning the
entire address space. This was previously discussed in [1, 2]."

>
> To minimize the size impact of adding pgoff to ksm_rmap_item as much as
> possible, a trick that David suggested is to use a UNION that groups the members
> related to the unstable tree together with the newly added linear page index. The
> members that valids only when in unstable tree include oldchecksum and age information.
> However, the function should_skip_rmap_item() in the smart scanning needs slight
> modification, since this function still uses the age information even when the
> rmap_item is in a stable state (the page is not KSM), a situation that occurs
> during COW faults. After using union, the size is still 64 byte without increasing.

And here

"To avoid growing ksm_rmap_item, let's squeeze it into the existing structure by
overlying some members (oldchecksum, age, remaining_skips) that are only
relevant while on the unstable tree. The new entry will only be relevant for
entries in the stable tree.

However, as the age information is read by should_skip_rmap_item() with the
smart-scanning approach even while we have an entry in the stable tree, but the
page changes (no longer a KSM page, for example due to COW), we have to change
the handling there a bit.
"

>
> We keep the same way to store the pgoff as rmap->anon_vma which is set when the page
> is merged and become a KsmPage at try_to_merge_with_ksm_page(), and reset at
> remove_rmap_item_from_tree() and remove_node_from_stable_tree() and reset when break_cow.
>

I wonder if "pgoff" is the right term here. Hm. Can we simply shamelessly call
it "linear_page_index" ? (or just index? )

So maybe

"We'll calculate the linear page index in try_to_merge_with_ksm_page(), when
adding it to the stable tree, and reset the index (to reset overlayed data) when
removing an item from the stable tree -- in remove_rmap_item_from_tree(),
remove_node_from_stable_tree() and break_cow()."


That means that age information is reset whenever merging fails. We could try
merging more often. I assume that's nothing we care about in practice. If we
ever do, we can e.g., look into remembering it on the callpath somehow.

But I suspect we don't care.


> To be specially clarified, the reason for resetting pgoff at break_cow() is:

s/pgoff/the stored index/

>
> - When a page successfully becomes a KSM page (i.e., after stable_tree_append()
> sets STABLE_FLAG), both anon_vma and vm_pgoff are stored and remain valid.

s/vm_off/the index/

> - However, during the merging process there are several failure paths where a
> page that was temporarily treated as a KSM page must be reverted back to an

What about:

"However, during the merging process, there are several failure paths where we
already prepared an rmap item to be added to the stable tree, but must revert
that as some part of the merge process failed. Examples include:"

> anonymous page. Examples include:
> * The second call to try_to_merge_with_ksm_page() fails in
> try_to_merge_two_pages().
> * stable_tree_insert() fails in cmp_and_merge_page().
> In such cases, break_cow() is invoked to break the COW mapping and discard
> the KSM state.
>
> Currently, break_cow() already contains a put_anon_vma(rmap_item->anon_vma)
> to release the reference taken during the aborted merge. Because 'pgoff' is
> logically paired with anon_vma (both are only meaningful when the rmap_item
> is in a stable state), it must also be cleared (or reset) in break_cow() to
> avoid leaving stale pgoff values that could confuse subsequent rmap walks
> or scanning logic.
>
> [1] https://lore.kernel.org/all/adTPQSb-qSSHviJN@lucifer/
> [2] https://lore.kernel.org/all/202604091806051535BJWZ_FTtdIm3Snk24ei_@xxxxxxxxxx/
>

[...]

> Suggested-by: David Hildenbrand (Arm) <david@xxxxxxxxxx>
> Signed-off-by: xu xin <xu.xin16@xxxxxxxxxx>
> ---
> mm/ksm.c | 41 ++++++++++++++++++++++++++++++++++-------
> 1 file changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 7d5b76478f0b..4761ca3fa984 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -195,22 +195,28 @@ struct ksm_stable_node {
> * @node: rb node of this rmap_item in the unstable tree
> * @head: pointer to stable_node heading this list in the stable tree
> * @hlist: link into hlist of rmap_items hanging off that stable_node
> - * @age: number of scan iterations since creation
> - * @remaining_skips: how many scans to skip
> + * @age: number of scan iterations since creation (unstable node)
> + * @remaining_skips: how many scans to skip (unstable node)
> + * @pgoff: pgoff into @anon_vma where the page is mapped (stable tree)

Is it then "stable node" ?

> */
> struct ksm_rmap_item {
> struct ksm_rmap_item *rmap_list;
> union {
> - struct anon_vma *anon_vma; /* when stable */
> + struct anon_vma *anon_vma; /* for reverse mapping, when stable */
> #ifdef CONFIG_NUMA
> int nid; /* when node of unstable tree */
> #endif
> };
> struct mm_struct *mm;
> unsigned long address; /* + low bits used for flags below */
> - unsigned int oldchecksum; /* when unstable */
> - rmap_age_t age;
> - rmap_age_t remaining_skips;
> + union {
> + struct {
> + unsigned int oldchecksum;
> + rmap_age_t age;
> + rmap_age_t remaining_skips;
> + }; /* when unstable */
> + unsigned long pgoff; /* for reverse mapping, when stable */

Again, suggesting "index" or "linear_page_index".

> + };
> union {
> struct rb_node node; /* when node of unstable tree */
> struct { /* when listed from stable tree */
> @@ -776,6 +782,10 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
> return vma;
> }
>
> +/*
> + * break_cow: actively break the write-protect of the VMA. This is called when

This description is not quite correct. We unshare the KSM page (break cow ->
create an anonymous apge copy). It has nothing to do with VMA write-protection.

"actively break COW, replacing the KSM page by a fresh anonymous page."


> + * rmap_item has not yet become stable, but page has been merged.
> + */
> static void break_cow(struct ksm_rmap_item *rmap_item)
> {
> struct mm_struct *mm = rmap_item->mm;
> @@ -787,6 +797,8 @@ static void break_cow(struct ksm_rmap_item *rmap_item)
> * to undo, we also need to drop a reference to the anon_vma.
> */
> put_anon_vma(rmap_item->anon_vma);
> + /* Reset pgoff that might overlay age-related information. (still unstable) */
> + rmap_item->pgoff = 0;
>
> mmap_read_lock(mm);
> vma = find_mergeable_vma(mm, addr);
> @@ -899,6 +911,8 @@ static void remove_node_from_stable_tree(struct ksm_stable_node *stable_node)
> VM_BUG_ON(stable_node->rmap_hlist_len <= 0);
> stable_node->rmap_hlist_len--;
> put_anon_vma(rmap_item->anon_vma);
> + /* Reset pgoff that might overlay age-related information. */
> + rmap_item->pgoff = 0;
> rmap_item->address &= PAGE_MASK;
> cond_resched();
> }
> @@ -1052,6 +1066,8 @@ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
> stable_node->rmap_hlist_len--;
>
> put_anon_vma(rmap_item->anon_vma);
> + /* Reset pgoff that might overlay age-related information. */
> + rmap_item->pgoff = 0;
> rmap_item->head = NULL;
> rmap_item->address &= PAGE_MASK;
>
> @@ -1598,8 +1614,15 @@ static int try_to_merge_with_ksm_page(struct ksm_rmap_item *rmap_item,
> /* Unstable nid is in union with stable anon_vma: remove first */
> remove_rmap_item_from_tree(rmap_item);
>
> - /* Must get reference to anon_vma while still holding mmap_lock */
> + /*
> + * Must get reference to anon_vma while still holding mmap_lock,

s/,/./

> + * We set these two members of stable node here instead of
> + * stable_tree_append(), maybe because we don't want to hold
> + * mmap_read_lock again. Here mmap_read_lock is already held to
> + * find_mergeable_vma before merging.

Is the "maybe" stuff really something we should put into a comment?

What about something like:

"Must can only reference the VMA while still holding the mmap lock, so
reference the anon_vma and calculate the linear page index early, before
stable_tree_append(). If anything goes wrong, break_cow() will clean it up."

> + */
> rmap_item->anon_vma = vma->anon_vma;
> + rmap_item->pgoff = linear_page_index(vma, rmap_item->address);
> get_anon_vma(vma->anon_vma);
> out:
> mmap_read_unlock(mm);
> @@ -2458,6 +2481,10 @@ static bool should_skip_rmap_item(struct folio *folio,
> if (folio_test_ksm(folio))
> return false;
>
> + /* There is no age information in stable-tree nodes. */

Maybe we can make it clearer here:

"There is no age information in stable-tree nodes. We might end up here without
a KSM page for example after COW."

> + if (rmap_item->address & STABLE_FLAG)
> + return false;
> +
> age = rmap_item->age;
> if (age != U8_MAX)
> rmap_item->age++;

I think with these things addressed, this LGTM.

--
Cheers,

David