Re: [PATCH 2/2] mm/frame-vec: use FOLL_LONGTERM

From: Jason Gunthorpe
Date: Fri Oct 02 2020 - 19:31:22 EST


On Fri, Oct 02, 2020 at 08:16:48PM +0200, Daniel Vetter wrote:
> On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote:
> > > For $reasons I've stumbled over this code and I'm not sure the change
> > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> > >
> > > This here is used for long term buffers (not just quick I/O) like
> > > RDMA, and John notes this in his patch. But I thought the rule for
> > > these is that they need to add FOLL_LONGTERM, which John's patch
> > > didn't do.
> > >
> > > There is already a dax specific check (added in b7f0554a56f2 ("mm:
> > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
> > > like the prudent thing to do.
> > >
> > > Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxxx>
> > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > > Cc: John Hubbard <jhubbard@xxxxxxxxxx>
> > > Cc: Jérôme Glisse <jglisse@xxxxxxxxxx>
> > > Cc: Jan Kara <jack@xxxxxxx>
> > > Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> > > Cc: linux-mm@xxxxxxxxx
> > > Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx
> > > Cc: linux-samsung-soc@xxxxxxxxxxxxxxx
> > > Cc: linux-media@xxxxxxxxxxxxxxx
> > > Hi all,
> > >
> > > I stumbled over this and figured typing this patch can't hurt. Really
> > > just to maybe learn a few things about how gup/pup is supposed to be
> > > used (we have a bit of that in drivers/gpu), this here isn't really
> > > ralated to anything I'm doing.
> >
> > FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO
>
> Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib
> verb access mode indicates possible writes. I'm not really clear on
> why FOLL_WRITE isn't enough any why you need to be able to write
> through a vma that's write protected currently.

Ah, FOLL_FORCE | FOLL_WRITE means *read* confusingly enough, and the
only reason you'd want this version for read is if you are doing
longterm stuff. I wrote about this recently:

https://lore.kernel.org/linux-mm/20200928235739.GU9916@xxxxxxxx/

> > Since every driver does this wrong anything that uses this is creating
> > terrifying security issues.
> >
> > IMHO this whole API should be deleted :(
>
> Yeah that part I just tried to conveniently ignore. I guess this dates
> back to a time when ioremaps where at best fixed, and there wasn't
> anything like a gpu driver dynamically managing vram around, resulting
> in random entirely unrelated things possibly being mapped to that set
> of pfns.

No, it was always wrong. Prior to GPU like cases the lifetime of the
PTE was tied to the vma and when the vma becomes free the driver can
move the things in the PTEs to 'free'. Easy to trigger use-after-free
issues and devices like RDMA have security contexts attached to these
PTEs so it becomes a serious security bug to do something like this.

> The underlying follow_pfn is also used in other places within
> drivers/media, so this doesn't seem to be an accident, but actually
> intentional.

Looking closely, there are very few users, most *seem* pointless, but
maybe there is a crazy reason?

The sequence
get_vaddr_frames();
frame_vector_to_pages();
sg_alloc_table_from_pages();

Should be written
pin_user_pages_fast(FOLL_LONGTERM);
sg_alloc_table_from_pages()

There is some 'special' code in frame_vector_to_pages() that tries to
get a struct page for things from a VM_IO or VM_PFNMAP...

Oh snap, that is *completely* broken! If the first VMA is IO|PFNMAP
then get_vaddr_frames() iterates over all VMAs in the range, of any
kind and extracts the PTEs then blindly references them! This means it
can be used to use after free normal RAM struct pages!! Gah!

Wow. Okay. That has to go.

So, I *think* we can assume there is no sane cases where
frame_vector_to_pages() succeeds but pin_user_pages() wasn't called.

That means the users here:
- habanalabs: Hey Oded can you fix this up?

- gpu/exynos: Daniel can you get someone there to stop using it?

- media/videobuf via vb2_dma_sg_get_userptr()

Should all be switched to the standard pin_user_pages sequence
above.

That leaves the only interesting places as vb2_dc_get_userptr() and
vb2_vmalloc_get_userptr() which both completely fail to follow the
REQUIRED behavior in the function's comment about checking PTEs. It
just DMA maps them. Badly broken.

Guessing this hackery is for some embedded P2P DMA transfer?

After he three places above should use pin_user_pages_fast(), then
this whole broken API should be moved into videobuf2-memops.c and a
big fat "THIS DOESN'T WORK" stuck on it.

videobuf2 should probably use P2P DMA buf for this instead.

Jason