Re: [v2] fs/proc/task_mmu: Implement IOCTL for efficient page table scanning

From: Muhammad Usama Anjum
Date: Mon Jul 24 2023 - 11:22:29 EST


On 7/24/23 7:38 PM, Michał Mirosław wrote:
> On Mon, 24 Jul 2023 at 16:04, Muhammad Usama Anjum
> <usama.anjum@xxxxxxxxxxxxx> wrote:
>>
>> Fixed found bugs. Testing it further.
>>
>> - Split and backoff in case buffer full case as well
>> - Fix the wrong breaking of loop if page isn't interesting, skip intead
>> - Untag the address and save them into struct
>> - Round off the end address to next page
>>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
>> ---
>> fs/proc/task_mmu.c | 54 ++++++++++++++++++++++++++--------------------
>> 1 file changed, 31 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index add21fdf3c9a..64b326d0ec6d 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1985,18 +1989,19 @@ static int pagemap_scan_output(unsigned long
>> categories,
>> unsigned long n_pages, total_pages;
>> int ret = 0;
>>
>> + if (!p->vec_buf)
>> + return 0;
>> +
>> if (!pagemap_scan_is_interesting_page(categories, p)) {
>> *end = addr;
>> return 0;
>> }
>>
>> - if (!p->vec_buf)
>> - return 0;
>> -
>> categories &= p->arg.return_mask;
>
> This is wrong - is_interesting() check must happen before output as
> the `*end = addr` means the range should be skipped, but return 0
> requests continuing of the walk.
Will revert.

>
>> @@ -2044,7 +2050,7 @@ static int pagemap_scan_thp_entry(pmd_t *pmd,
>> unsigned long start,
>> * Break huge page into small pages if the WP operation
>> * need to be performed is on a portion of the huge page.
>> */
>> - if (end != start + HPAGE_SIZE) {
>> + if (end != start + HPAGE_SIZE || ret == -ENOSPC) {
>
> Why is it needed? If `end == start + HPAGE_SIZE` then we're handling a
> full hugepage anyway.
If we weren't able to add the complete thp in the output buffer and we need
to perform WP on the entire page, we should split and rollback. Otherwise
we'll WP the entire thp and we'll lose the state on the remaining THP which
wasn't added to output.

Lets say max=100
only 100 pages would be added to output
we need to split and rollback otherwise other 412 pages would get WP

>
>> @@ -2066,8 +2072,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd,
>> unsigned long start,
>> {
>> struct pagemap_scan_private *p = walk->private;
>> struct vm_area_struct *vma = walk->vma;
>> + unsigned long addr, categories, next;
>> pte_t *pte, *start_pte;
>> - unsigned long addr;
>> bool flush = false;
>> spinlock_t *ptl;
>> int ret;
>> @@ -2088,12 +2094,14 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd,
>> unsigned long start,
>> }
>>
>> for (addr = start; addr != end; pte++, addr += PAGE_SIZE) {
>> - unsigned long categories = p->cur_vma_category |
>> - pagemap_page_category(vma, addr, ptep_get(pte));
>> - unsigned long next = addr + PAGE_SIZE;
>> + categories = p->cur_vma_category |
>> + pagemap_page_category(vma, addr, ptep_get(pte));
>> + next = addr + PAGE_SIZE;
>
> Why moving the variable declarations out of the loop?
Saving spaces inside loop. What are pros of declation of variable in loop?

>
>>
>> ret = pagemap_scan_output(categories, p, addr, &next);
>> - if (next == addr)
>> + if (ret == 0 && next == addr)
>> + continue;
>> + else if (next == addr)
>> break;
>
> Ah, this indeed was a bug. Nit:
>
> if (next == addr) { if (!ret) continue; break; }
>
>> @@ -2204,8 +2212,6 @@ static const struct mm_walk_ops pagemap_scan_ops = {
>> static int pagemap_scan_get_args(struct pm_scan_arg *arg,
>> unsigned long uarg)
>> {
>> - unsigned long start, end, vec;
>> -
>> if (copy_from_user(arg, (void __user *)uarg, sizeof(*arg)))
>> return -EFAULT;
>>
>> @@ -2219,22 +2225,24 @@ static int pagemap_scan_get_args(struct pm_scan_arg
>> *arg,
>> arg->category_anyof_mask | arg->return_mask) & ~PM_SCAN_CATEGORIES)
>> return -EINVAL;
>>
>> - start = untagged_addr((unsigned long)arg->start);
>> - end = untagged_addr((unsigned long)arg->end);
>> - vec = untagged_addr((unsigned long)arg->vec);
>> + arg->start = untagged_addr((unsigned long)arg->start);
>> + arg->end = untagged_addr((unsigned long)arg->end);
>> + arg->vec = untagged_addr((unsigned long)arg->vec);
>
> BTW, We should we keep the tag in args writeback().
Sorry what?
After this function, the start, end and vec would be used. We need to make
sure that the address are untagged before that.

>
>> /* Validate memory pointers */
>> - if (!IS_ALIGNED(start, PAGE_SIZE))
>> + if (!IS_ALIGNED(arg->start, PAGE_SIZE))
>> return -EINVAL;
>> - if (!access_ok((void __user *)start, end - start))
>> + if (!access_ok((void __user *)arg->start, arg->end - arg->start))
>> return -EFAULT;
>> - if (!vec && arg->vec_len)
>> + if (!arg->vec && arg->vec_len)
>> return -EFAULT;
>> - if (vec && !access_ok((void __user *)vec,
>> + if (arg->vec && !access_ok((void __user *)arg->vec,
>> arg->vec_len * sizeof(struct page_region)))
>> return -EFAULT;
>>
>> /* Fixup default values */
>> + arg->end = (arg->end & ~PAGE_MASK) ?
>> + ((arg->end & PAGE_MASK) + PAGE_SIZE) : (arg->end);
>
> arg->end = ALIGN(arg->end, PAGE_SIZE);
>
>> if (!arg->max_pages)
>> arg->max_pages = ULONG_MAX;
>>
>
> Best Regards
> Michał Mirosław

--
BR,
Muhammad Usama Anjum