Re: [PATCH 2/2] mm/frame-vec: use FOLL_LONGTERM

From: Tomasz Figa
Date: Wed Oct 07 2020 - 09:06:38 EST


On Sat, Oct 3, 2020 at 1:31 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Fri, Oct 02, 2020 at 08:16:48PM +0200, Daniel Vetter wrote:
> > On Fri, Oct 2, 2020 at 8:06 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > > On Fri, Oct 02, 2020 at 07:53:03PM +0200, Daniel Vetter wrote:
> > > > For $reasons I've stumbled over this code and I'm not sure the change
> > > > to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
> > > > convert get_user_pages() --> pin_user_pages()") was entirely correct.
> > > >
> > > > This here is used for long term buffers (not just quick I/O) like
> > > > RDMA, and John notes this in his patch. But I thought the rule for
> > > > these is that they need to add FOLL_LONGTERM, which John's patch
> > > > didn't do.
> > > >
> > > > There is already a dax specific check (added in b7f0554a56f2 ("mm:
> > > > fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
> > > > like the prudent thing to do.
> > > >
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxxx>
> > > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > > > Cc: John Hubbard <jhubbard@xxxxxxxxxx>
> > > > Cc: Jérôme Glisse <jglisse@xxxxxxxxxx>
> > > > Cc: Jan Kara <jack@xxxxxxx>
> > > > Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> > > > Cc: linux-mm@xxxxxxxxx
> > > > Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx
> > > > Cc: linux-samsung-soc@xxxxxxxxxxxxxxx
> > > > Cc: linux-media@xxxxxxxxxxxxxxx
> > > > Hi all,
> > > >
> > > > I stumbled over this and figured typing this patch can't hurt. Really
> > > > just to maybe learn a few things about how gup/pup is supposed to be
> > > > used (we have a bit of that in drivers/gpu), this here isn't really
> > > > ralated to anything I'm doing.
> > >
> > > FOLL_FORCE is a pretty big clue it should be FOLL_LONGTERM, IMHO
> >
> > Since you're here ... I've noticed that ib sets FOLL_FORCE when the ib
> > verb access mode indicates possible writes. I'm not really clear on
> > why FOLL_WRITE isn't enough any why you need to be able to write
> > through a vma that's write protected currently.
>
> Ah, FOLL_FORCE | FOLL_WRITE means *read* confusingly enough, and the
> only reason you'd want this version for read is if you are doing
> longterm stuff. I wrote about this recently:
>
> https://lore.kernel.org/linux-mm/20200928235739.GU9916@xxxxxxxx/
>
> > > Since every driver does this wrong anything that uses this is creating
> > > terrifying security issues.
> > >
> > > IMHO this whole API should be deleted :(
> >
> > Yeah that part I just tried to conveniently ignore. I guess this dates
> > back to a time when ioremaps where at best fixed, and there wasn't
> > anything like a gpu driver dynamically managing vram around, resulting
> > in random entirely unrelated things possibly being mapped to that set
> > of pfns.
>
> No, it was always wrong. Prior to GPU like cases the lifetime of the
> PTE was tied to the vma and when the vma becomes free the driver can
> move the things in the PTEs to 'free'. Easy to trigger use-after-free
> issues and devices like RDMA have security contexts attached to these
> PTEs so it becomes a serious security bug to do something like this.
>
> > The underlying follow_pfn is also used in other places within
> > drivers/media, so this doesn't seem to be an accident, but actually
> > intentional.
>
> Looking closely, there are very few users, most *seem* pointless, but
> maybe there is a crazy reason?
>
> The sequence
> get_vaddr_frames();
> frame_vector_to_pages();
> sg_alloc_table_from_pages();
>
> Should be written
> pin_user_pages_fast(FOLL_LONGTERM);
> sg_alloc_table_from_pages()
>
> There is some 'special' code in frame_vector_to_pages() that tries to
> get a struct page for things from a VM_IO or VM_PFNMAP...
>
> Oh snap, that is *completely* broken! If the first VMA is IO|PFNMAP
> then get_vaddr_frames() iterates over all VMAs in the range, of any
> kind and extracts the PTEs then blindly references them! This means it
> can be used to use after free normal RAM struct pages!! Gah!
>
> Wow. Okay. That has to go.
>
> So, I *think* we can assume there is no sane cases where
> frame_vector_to_pages() succeeds but pin_user_pages() wasn't called.
>
> That means the users here:
> - habanalabs: Hey Oded can you fix this up?
>
> - gpu/exynos: Daniel can you get someone there to stop using it?
>
> - media/videobuf via vb2_dma_sg_get_userptr()
>
> Should all be switched to the standard pin_user_pages sequence
> above.
>
> That leaves the only interesting places as vb2_dc_get_userptr() and
> vb2_vmalloc_get_userptr() which both completely fail to follow the
> REQUIRED behavior in the function's comment about checking PTEs. It
> just DMA maps them. Badly broken.

Note that vb2_vmalloc is only used for in-kernel CPU usage, e.g. the
contents being copied by the driver between vb2 buffers and some
hardware FIFO or other dedicated buffers. The memory does not go to
any hardware DMA.

Could you elaborate on what "the REQUIRED behavior is"? I can see that
both follow the get_vaddr_frames() -> frame_vector_to_pages() flow, as
you mentioned. Perhaps the only change needed is switching to
pin_user_pages after all?

>
> Guessing this hackery is for some embedded P2P DMA transfer?
>

That's not the case. There are two usage scenarios:

1) Regular user pointer support, i.e. using malloc()ed memory as a
video input or output buffer. This is deemed to be deprecated, but
still used quite widely by various applications. Quite common for
video decoder input and video encoder output, because the other side
of the pipeline is CPU, e.g. network stack.

2) Sharing carveout buffers (without struct pages attached) between
two devices. I believe this predates DMA-buf and CMA (which enabled
carveouts with struct pages) and I'm not sure how commonly it is used
these days.

> After he three places above should use pin_user_pages_fast(), then
> this whole broken API should be moved into videobuf2-memops.c and a
> big fat "THIS DOESN'T WORK" stuck on it.
>
> videobuf2 should probably use P2P DMA buf for this instead.

See above.

Best regards,
Tomasz