Re: [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode
From: Chenbo Feng
Date: Mon Mar 25 2019 - 15:35:16 EST
On Sun, Mar 24, 2019 at 1:45 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi Sandeep,
>
> On Sun, Mar 24, 2019 at 10:56:33AM -0700, Sandeep Patil wrote:
> > On Fri, Mar 22, 2019 at 11:02:55AM -0400, Joel Fernandes wrote:
> > > On Thu, Mar 21, 2019 at 07:51:33PM -0700, Chenbo Feng wrote:
> > > > From: Greg Hackmann <ghackmann@xxxxxxxxxx>
> > > >
> > > > By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN
> > > > can get a lot of fine-grained data about how shmem buffers are shared
> > > > among processes. stat(2) on each entry gives the caller a unique
> > > > ID (st_ino), the buffer's size (st_size), and even the number of pages
> > > > currently charged to the buffer (st_blocks / 512).
> > > >
> > > > In contrast, all dma-bufs share the same anonymous inode. So while we
> > > > can count how many dma-buf fds or mappings a process has, we can't get
> > > > the size of the backing buffers or tell if two entries point to the same
> > > > dma-buf. On systems with debugfs, we can get a per-buffer breakdown of
> > > > size and reference count, but can't tell which processes are actually
> > > > holding the references to each buffer.
> > > >
> > > > Replace the singleton inode with full-fledged inodes allocated by
> > > > alloc_anon_inode(). This involves creating and mounting a
> > > > mini-pseudo-filesystem for dma-buf, following the example in fs/aio.c.
> > > >
> > > > Signed-off-by: Greg Hackmann <ghackmann@xxxxxxxxxx>
> > >
> > > I believe Greg's address needs to be updated on this patch otherwise the
> > > emails would just bounce, no? I removed it from the CC list. You can still
> > > keep the SOB I guess but remove it from the CC list when sending.
> > >
> > > Also about the minifs, just playing devil's advocate for why this is needed.
> > >
> > > Since you are already adding the size information to /proc/pid/fdinfo/<fd> ,
> > > can just that not be used to get the size of the buffer? What is the benefit
> > > of getting this from stat? The other way to get the size would be through
> > > another IOCTL and that can be used to return other unique-ness related metadata
> > > as well. Neither of these need creation of a dedicated inode per dmabuf.
> >
> > Can you give an example of "unique-ness related data" here? The inode seems
> > like the best fit cause its already unique, no?
>
> I was thinking dma_buf file pointer, but I agree we need the per-inode now (see below).
>
> > > Also what is the benefit of having st_blocks from stat? AFAIK, that is the
> > > same as the buffer's size which does not change for the lifetime of the
> > > buffer. In your patch you're doing this when 'struct file' is created which
> > > AIUI is what reflects in the st_blocks:
> > > + inode_set_bytes(inode, dmabuf->size);
> >
> > Can some of the use cases / data be trimmed down? I think so. For example, I
> > never understood what we do with map_files here (or why). It is perfectly
> > fine to just get the data from /proc/<pid>/fd and /proc/<pid>/maps. I guess
> > the map_files bit is for consistency?
>
> It just occured to me that since /proc/<pid/maps provides an inode number as
> one of the fields, so indeed an inode per buf is a very good idea for the
> user to distinguish buffers just by reading /proc/<pid/<maps> alone..
>
> I also, similar to you, don't think map_files is useful for this usecase.
> map_files are just symlinks named as memory ranges and pointing to a file. In
> this case the symlink will point to default name "dmabuf" that doesn't exist.
> If one does stat(2) on a map_file link, then it just returns the inode number
> of the symlink, not what the map_file is pointing to, which seems kind of
> useless.
>
I might be wrong but I don't think we did anything special for the
map_files in this patch. I think the confusion might be from commit
message where Greg mentioned the map_files to describe the behavior of
shmem buffer when comparing it with dma-buf. The file system
implementation and the file allocation action in this patch are just
some minimal effort to associate each dma_buf object with an inode and
properly populate the size information in the file object. And we
didn't use proc/pid/map_files at all in the android implementation
indeed.
>
> Which makes me think both maps and map_files can be made more useful if we can
> also make DMA_BUF_SET_NAME in the patch change the underlying dentry's name
> from the default "dmabuf" to "dmabuf:<name>" ?
>
> That would be useful because:
> 1. It should make /proc/pid/maps also have the name than always showing
> "dmabuf".
> 2. It should make map_files also point to the name of the buffer than just
> "dmabuf". Note that memfd_create(2) already takes a name and the maps_file
> for this points to the name of the buffer created and showing it in both maps
> and map_files.
>
> I think this also removes the need for DMA_BUF_GET_NAME ioctl since the
> dentry's name already has the information. I can try to look into that...
> BTW any case we should not need GET_NAME ioctl since fdinfo already has the
> name after SET_NAME is called. So let us drop that API?
>
> > May be, to make it generic, we make the tracking part optional somehow to
> > avoid the apparent wastage on other systems.
>
> Yes, that's also fine. But I think if we can bake tracking into existing
> mechanism and keep it always On, then that's also good for all other dmabuf
> users as well and simplifies the kernel configuration for vendors.
>
> > > I am not against adding of inode per buffer, but I think we should have this
> > > debate and make the right design choice here for what we really need.
> >
> > sure.
>
> Right, so just to summarize:
> - The intention here is /proc/<pid>/maps will give uniqueness (via the inode
> number) between different memory ranges. That I think is the main benefit
> of the patch.
> - stat gives the size of buffer as does fdinfo
> - fdinfo is useful to get the reference count of number of sharers of the
> buffer.
> - map_files is not that useful for this usecase but can be made useful if
> we can name the underlying file's dentry to something other than "dmabuf".
> - GET_NAME is not needed since fdinfo already has the SET_NAMEd name.
>
> Do you agree?
>
Thanks for summarize it, I will look into the GET_NAME/SET_NAME ioctl
to make it more useful as you suggested above. Also, I will try to add
some test to verify the behavior.
>
> Just to lay it out, there is a cost to unique inode. Each struct inode is 560
> bytes on mainline with x86_64_defconfig. With 1000 buffers, we're looking at
> ~ 0.5MB of allocation. However I think I am convinced we need to do it
> considering the advantages, and the size is trivial considering advantages.
> Arguably large number dmabuf allocations are more likely to succeed with
> devices with larger memory resources anyway :)
>
> It is good to have this discussion.
>
> thanks,
>
> - Joel
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxxx
>