Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

From: Hannes Frederic Sowa
Date: Mon Oct 19 2015 - 14:46:33 EST


Hi,

On Mon, Oct 19, 2015, at 20:15, Alexei Starovoitov wrote:
> On 10/19/15 10:37 AM, Daniel Borkmann wrote:
> > An eBPF program or map loading/destruction is *not* by any means to be
> > considered fast-path. We currently hold a global mutex during loading.
> > So, how can that be considered fast-path? Similarly, socket creation/
> > destruction is also not fast-path, etc. Do you expect that applications
> > would create/destroy these devices within milliseconds? I'd argue that
> > something would be seriously wrong with that application, then. Such
> > persistent maps are to be considered rather mid-long living objects in
> > the system. The fast-path surely is the data-path of them.
>
> consider seccomp that loads several programs for every tab, then
> container use case where we're loading two for each as well.
> Who knows what use cases will come up in the future.
> It's obviously not a fast path that is being hit million times a second,
> but why add overhead when it's unnecessary?

But the browser using seccomp does not need persistent maps at all. It
would merely create temporary maps without the overhead which are
automatically released when the program finished. In no way should they
be persistent but automatically garbage collected as soon as the
supervisor process and its children die.

> >> completely unnecessary here. The kernel will consume more memory for
> >> no real reason other than cdev are used to keep prog/maps around.
> >
> > I don't consider this a big issue, and well worth the trade-off. You'll
> > have an infrastructure that integrates *nicely* into the *existing* kernel
> > model *and* tooling with the proposed patch. This is a HUGE plus. The
> > UAPI of this is simple and minimal. And to me, these are in-fact special
> > files, not regular ones.
>
> Seriously? Syscall to create/destory cdevs is a nice api?

It is pretty normal to do. They don't use syscall but ioctl on a special
pseudo device node (lvm2/device-mapper, tun, etc.). Same thing, very
overloaded syscall. Also I don't see a reason to not make the creation
possible via sysfs, which would be even nicer but not orthogonal to the
current creation of maps, which is nowadays set in stone by uapi.

> Not by any means. We can argue in circles, but it doesn't fit.
> Using cdev to hold maps is a hack.
> Telling kernel that fake device is created only to hold a map?
> This fake device doesn't have any of the device properties.
> Look at fops->open. Surely it's a smart trick, but it's not behaving
> like device.
>
> uapi for fs adds only two commands, whereas cdev needs three.
>
> >> imo fs is cleaner and we can tailor it to be similar to cdev style.
> >
> > Really, IMHO I think this is over-designed, and much much more hacky. We
> > design a whole new file system that works *exactly* like cdevs, takes
> > likely more than twice the code and complexity to realize but just to
> > save a few bytes ...? I don't understand that.
>
> Let's argue with facts. Your fs patch 758 lines vs cdev 483.
> In fs the cost is single alloc_inode(), whereas in cdev the struct cdev
> and mutex will add to all maps and progs (even when they're not going
> to be pinned), plus kobj allocation, plus gigantic struct device, plus
> sysfs inodes, etc. Way more than few bytes of difference.

sysfs already has infrastructure to ensure supportability and
debugability. Dependency graphs can be created to see which programs
depend on which bpf_progs. kobjects are singeltons in sysfs
representation. If you have multiple ebpf filesystems, even maybe
referencing the same hashtable with gigabytes of data multiple times,
there needs to be some way to help administrators to check resource
usage, statistics, tweak and tune the rhashtable. All this needs to be
handled as well in the future. It doesn't really fit the filesystem
model, but representing a kobject seems to be a good fit to me.

Policy can be done by user space with help of udev. Selinux policies can
easily being extended to allow specific domains access. Namespace
"passthrough" is defined for devices already.

> where do you see 'over-design' in fs? It's a straightforward code and
> most of it is boilerplate like all other fs.
> Kill rmdir/mkdir ops, term bit and options and the fs diff will shrink
> by another 200+ lines.

I don't think that only a filesystem will do it in the foreseeable
future. You want to have tools like lsof reporting which map and which
program has references to each other. Those file nodes will need more
metadata attached in future anyway, so currently just comparing lines of
codes seems not to be a good way for arguing.

With filesystems suddenly a lot of other questions arise: selinux
support, acls etc.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/