Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting
From: Hugh Dickins
Date: Mon Dec 28 2020 - 23:36:43 EST
Got it at last, sorry it's taken so long.
On Tue, 29 Dec 2020, Kirill A. Shutemov wrote:
> On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote:
> > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
> > > >
> > > > So far I only found one more pin leak and always-true check. I don't see
> > > > how can it lead to crash or corruption. Keep looking.
Those mods look good in themselves, but, as you expected,
made no difference to the corruption I was seeing.
> > >
> > > Well, I noticed that the nommu.c version of filemap_map_pages() needs
> > > fixing, but that's obviously not the case Hugh sees.
> > >
> > > No,m I think the problem is the
> > >
> > > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > >
> > > at the end of filemap_map_pages().
> > >
> > > Why?
> > >
> > > Because we've been updating vmf->pte as we go along:
> > >
> > > vmf->pte += xas.xa_index - last_pgoff;
> > >
> > > and I think that by the time we get to that "pte_unmap_unlock()",
> > > vmf->pte potentially points to past the edge of the page directory.
> >
> > Well, if it's true we have bigger problem: we set up an pte entry without
> > relevant PTL.
> >
> > But I *think* we should be fine here: do_fault_around() limits start_pgoff
> > and end_pgoff to stay within the page table.
Yes, Linus's patch had made no difference,
the map_pages loop is safe in that respect.
> >
> > It made mw looking at the code around pte_unmap_unlock() and I think that
> > the bug is that we have to reset vmf->address and NULLify vmf->pte once we
> > are done with faultaround:
> >
> > diff --git a/mm/memory.c b/mm/memory.c
>
> Ugh.. Wrong place. Need to sleep.
>
> I'll look into your idea tomorrow.
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 87671284de62..e4daab80ed81 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
> } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> rcu_read_unlock();
> + vmf->address = address;
> + vmf->pte = NULL;
> WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
>
> return ret;
> --
And that made no (noticeable) difference either. But at last
I realized, it's absolutely on the right track, but missing the
couple of early returns at the head of filemap_map_pages(): add
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3025,14 +3025,12 @@ vm_fault_t filemap_map_pages(struct vm_f
rcu_read_lock();
head = first_map_page(vmf, &xas, end_pgoff);
- if (!head) {
- rcu_read_unlock();
- return 0;
- }
+ if (!head)
+ goto out;
if (filemap_map_pmd(vmf, head)) {
- rcu_read_unlock();
- return VM_FAULT_NOPAGE;
+ ret = VM_FAULT_NOPAGE;
+ goto out;
}
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -3066,9 +3064,9 @@ unlock:
put_page(head);
} while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
rcu_read_unlock();
vmf->address = address;
- vmf->pte = NULL;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
return ret;
--
and then the corruption is fixed. It seems miraculous that the
machines even booted with that bad vmf->address going to __do_fault():
maybe that tells us what a good job map_pages does most of the time.
You'll see I've tried removing the "vmf->pte = NULL;" there. I did
criticize earlier that vmf->pte was being left set, but was either
thinking back to some earlier era of mm/memory.c, or else confusing
with vmf->prealloc_pte, which is NULLed when consumed: I could not
find anywhere in mm/memory.c which now needs vmf->pte to be cleared,
and I seem to run fine without it (even on i386 HIGHPTE).
So, the mystery is solved; but I don't think any of these patches
should be applied. Without thinking through Linus's suggestions
re do_set_pte() in particular, I do think this map_pages interface
is too ugly, and given us lots of trouble: please take your time
to go over it all again, and come up with a cleaner patch.
I've grown rather jaded, and questioning the value of the rework:
I don't think I want to look at or test another for a week or so.
Hugh