Re: [PATCH v3 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()

From: Qi Zheng
Date: Thu Sep 12 2024 - 05:29:03 EST


Hi Muchun,

On 2024/9/6 15:20, Muchun Song wrote:


On 2024/9/4 16:40, Qi Zheng wrote:
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:

1) After acquiring PTL, only read-only operations are performed on the PTE
    page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
    that the PTE page will not be freed, and there is no need to worry
    about whether the pmd entry is modified.

2) After acquiring PTL, the pte or pmd entries may be modified. At this
    time, we need to ensure that the pmd entry has not been modified
    concurrently.

To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.

Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.

Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.

Signed-off-by: Qi Zheng<zhengqi.arch@xxxxxxxxxxxxx>
---
  Documentation/mm/split_page_table_lock.rst |  7 +++
  include/linux/mm.h                         |  5 +++
  mm/pgtable-generic.c                       | 50 ++++++++++++++++++++++
  3 files changed, 62 insertions(+)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..08d0e706a32db 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
   - pte_offset_map_nolock()
      maps PTE, returns pointer to PTE with pointer to its PTE table
      lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_ro_nolock()
+    maps PTE, returns pointer to PTE with pointer to its PTE table
+    lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_rw_nolock()
+    maps PTE, returns pointer to PTE with pointer to its PTE table
+    lock (not taken) and the value of its pmd entry, or returns NULL
+    if no PTE table;
   - pte_offset_map()
      maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
   - pte_unmap()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7c74a840249a..1fde9242231c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3006,6 +3006,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
  pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
              unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+                unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+                unsigned long addr, pmd_t *pmdvalp,
+                spinlock_t **ptlp);
  #define pte_unmap_unlock(pte, ptl)    do {        \
      spin_unlock(ptl);                \
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711ac..262b7065a5a2e 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
      return pte;
  }
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+                unsigned long addr, spinlock_t **ptlp)
+{
+    pmd_t pmdval;
+    pte_t *pte;
+
+    pte = __pte_offset_map(pmd, addr, &pmdval);
+    if (likely(pte))
+        *ptlp = pte_lockptr(mm, &pmdval);
+    return pte;
+}
+
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+                unsigned long addr, pmd_t *pmdvalp,
+                spinlock_t **ptlp)
+{
+    pmd_t pmdval;
+    pte_t *pte;
+
+    VM_WARN_ON_ONCE(!pmdvalp);
+    pte = __pte_offset_map(pmd, addr, &pmdval);
+    if (likely(pte))
+        *ptlp = pte_lockptr(mm, &pmdval);
+    *pmdvalp = pmdval;
+    return pte;
+}
+
  /*
   * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
   * __pte_offset_map_lock() below, is usually called with the pmd pointer for
@@ -356,6 +383,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
   * recheck *pmd once the lock is taken; in practice, no callsite needs that -
   * either the mmap_lock for write, or pte_same() check on contents, is enough.
   *
+ * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
+ * but when successful, it also outputs a pointer to the spinlock in ptlp - as
+ * pte_offset_map_lock() does, but in this case without locking it. This helps
+ * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
+ * act on a changed *pmd: pte_offset_map_ro_nolock() provides the correct spinlock
+ * pointer for the page table that it returns. Even after grabbing the spinlock,
+ * we might be looking either at a page table that is still mapped or one that
+ * was unmapped and is about to get freed. But for R/O access this is sufficient.
+ * So it is only applicable for read-only cases where any modification operations
+ * to the page table are not allowed even if the corresponding spinlock is held
+ * afterwards.
+ *
+ * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
+ * pte_offset_map_ro_nolock(); but when successful, it also outputs the pdmval.
+ * It is applicable for may-write cases where any modification operations to the
+ * page table may happen after the corresponding spinlock is held afterwards.
+ * But the users should make sure the page table is stable like checking pte_same()
+ * or checking pmd_same() by using the output pmdval before performing the write
+ * operations.

Now, we have two options to make sure the stability of PTE for users
of pte_offset_map_rw_nolock(), in order to ease this operation, how
about proposing a new helper (or two, one for pmd_same, another for
pte_same) like pte_lock_stability (I am not good at naming, maybe
you can) which helps users 1) hold the PTL and 2) check if the PTE is
stable and 3) return true if the PTE stable, otherwise return false.

I've been trying to do this these days, but I found it was not very
convenient.

I introduced the following helpers:

#define __PTE_STABILITY(lock) \
bool __pte_stability_##lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte, \
pte_t *orig_pte, spinlock_t *ptlp) \
{ \
pte_spin_##lock(ptlp); \
if (orig_pte) { \
VM_WARN_ON_ONCE(pte_none(*orig_pte)); \
return pte_same(*orig_pte, ptep_get(pte)); \
} \
if (orig_pmd) { \
VM_WARN_ON_ONCE(pmd_none(*orig_pmd)); \
return pmd_same(*orig_pmd, pmdp_get_lockless(pmd)); \
} \
VM_WARN_ON_ONCE(1); \
return false; \
}
__PTE_STABILITY(lock)
__PTE_STABILITY(lock_nested)

static inline bool pte_stability_lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte,
pte_t *orig_pte, spinlock_t *ptlp)
__acquires(ptlp)
{
return __pte_stability_lock(pmd, orig_pmd, pte, orig_pte, ptlp);
}

#ifdef CONFIG_SPLIT_PTE_PTLOCKS
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
pte_t *pte, pte_t *orig_pte,
spinlock_t *ptlp)
__acquires(ptlp)
{
return __pte_stability_lock_nested(pmd, orig_pmd, pte, orig_pte, ptlp);
}

static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
__releases(ptlp)
{
spin_unlock(ptlp);
}
#else
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
pte_t *pte, pte_t *orig_pte,
spinlock_t *ptlp)
{
return true;
}
static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
{
}
#endif /* CONFIG_SPLIT_PTE_PTLOCKS */

and try to use them with pte_offset_map_rw_nolock() in the following
functions:

1. collapse_pte_mapped_thp
2. handle_pte_fault
3. map_pte
4. move_pages_pte
5. walk_pte_range

For 1, 2 and 3, the conversion is relatively simple, but 2 actually
already does a pte_same() check, so it does not reduce the amount of
code much.

For 4, the pte_same() checks have already been done, and it is not
easy to convert double_pt_lock() to use pte_stability_lock{_nested}().

For 5, it calls spin_trylock(), we should introduce another
pte_stability_trylock() helper for it, but it feels unnecessary.

There are not many places where pte_offset_map_rw_nolock() is called,
and some places have already done pte_same() checks, so maybe open
code is enough and there is no need to introduce more helper function.

Thanks,
Qi


Muchun,
Thanks.

+ *
+ * Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will
+ * be read-only/read-write protected.
+ *
   * Note that free_pgtables(), used after unmapping detached vmas, or when
   * exiting the whole mm, does not take page table lock before freeing a page
   * table, and may not use RCU at all: "outsiders" like khugepaged should avoid