Re: [PATCH 1/4] mm: pagewalk: assert write mmap lock only for walking the user page tables

From: Muchun Song
Date: Sat Dec 02 2023 - 04:47:57 EST




> On Dec 2, 2023, at 17:25, Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
>
> 
>
> On 2023/12/2 16:08, Muchun Song wrote:
>>>> On Dec 1, 2023, at 19:09, Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
>>>
>>>
>>>
>>> On 2023/11/27 16:46, Muchun Song wrote:
>>>> The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker")
>>>> introduces an assertion to walk_page_range_novma() to make all the users
>>>> of page table walker is safe. However, the race only exists for walking the
>>>> user page tables. And it is ridiculous to hold a particular user mmap write
>>>> lock against the changes of the kernel page tables. So only assert at least
>>>> mmap read lock when walking the kernel page tables. And some users matching
>>>> this case could downgrade to a mmap read lock to relief the contention of
>>>> mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read
>>>> lock) in the next patch.
>>>> Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
>>>> ---
>>>> mm/pagewalk.c | 29 ++++++++++++++++++++++++++++-
>>>> 1 file changed, 28 insertions(+), 1 deletion(-)
>>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>>> index b7d7e4fcfad7a..f46c80b18ce4f 100644
>>>> --- a/mm/pagewalk.c
>>>> +++ b/mm/pagewalk.c
>>>> @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>>>> * not backed by VMAs. Because 'unusual' entries may be walked this function
>>>> * will also not lock the PTEs for the pte_entry() callback. This is useful for
>>>> * walking the kernel pages tables or page tables for firmware.
>>>> + *
>>>> + * Note: Be careful to walk the kernel pages tables, the caller may be need to
>>>> + * take other effective approache (mmap lock may be insufficient) to prevent
>>>> + * the intermediate kernel page tables belonging to the specified address range
>>>> + * from being freed (e.g. memory hot-remove).
>>>> */
>>>> int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>>>> unsigned long end, const struct mm_walk_ops *ops,
>>>> @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>>>> if (start >= end || !walk.mm)
>>>> return -EINVAL;
>>>> - mmap_assert_write_locked(walk.mm);
>>>> + /*
>>>> + * 1) For walking the user virtual address space:
>>>> + *
>>>> + * The mmap lock protects the page walker from changes to the page
>>>> + * tables during the walk. However a read lock is insufficient to
>>>> + * protect those areas which don't have a VMA as munmap() detaches
>>>> + * the VMAs before downgrading to a read lock and actually tearing
>>>> + * down PTEs/page tables. In which case, the mmap write lock should
>>>> + * be hold.
>>>> + *
>>>> + * 2) For walking the kernel virtual address space:
>>>> + *
>>>> + * The kernel intermediate page tables usually do not be freed, so
>>>> + * the mmap map read lock is sufficient. But there are some exceptions.
>>>> + * E.g. memory hot-remove. In which case, the mmap lock is insufficient
>>>> + * to prevent the intermediate kernel pages tables belonging to the
>>>> + * specified address range from being freed. The caller should take
>>>> + * other actions to prevent this race.
>>>> + */
>>>> + if (mm == &init_mm)
>>>> + mmap_assert_locked(walk.mm);
>>>> + else
>>>> + mmap_assert_write_locked(walk.mm);
>>>
>>> Maybe just use process_mm_walk_lock() and set correct page_walk_lock in struct mm_walk_ops?
>> No. You also need to make sure the users do not pass the wrong
>> walk_lock, so you also need to add something like following:
>
> But all other walk_page_XX has been converted,see more from commit
> 49b0638502da "mm: enable page walking API to lock vmas during the walk"),
> there's nothing special about this one, the calls must pass the right
> page_walk_lock to mm_walk_ops,

If you think this one is not special, why it is not converted by that commit at that time?

>
>> if (mm == &init_mm)
>> VM_BUG_ON(walk_lock != PGWALK_RDLOCK);
>> else
>> VM_BUG_ON(walk_lock == PGWALK_RDLOCK);
>> I do not think the code will be simple.
>
> or adding the above lock check into process_mm_walk_lock too.

No. it’s wrong. walk_page_range_novma is special compared with other variants, the check is only applicable for walk_page_range_novma, not for its variants.