Re: [PATCH bpf-next 1/6] bpf: introduce BPF_PROG_TYPE_FILE_FILTER

From: Alexei Starovoitov
Date: Fri Oct 05 2018 - 18:05:26 EST


On Fri, Oct 05, 2018 at 05:46:59AM +0100, Al Viro wrote:
> On Wed, Oct 03, 2018 at 07:57:45PM -0700, Alexei Starovoitov wrote:
>
> > @@ -15,6 +15,7 @@
> > #include <linux/bpf.h>
> > #include <linux/bpf-cgroup.h>
> > #include <net/sock.h>
> > +#include <../fs/mount.h>
>
> No.

I've considered splitting cgroup_file_filter_ctx_access() into
separate .c inside fs/ directory, but it felt odd to move just that
function whereas the rest of the bpf bits are in kernel/bpf/
only to avoid doing this "../fs/" hack.
How about calling this new file fs/bpf_file_filter.c ?

> > + struct file *file = NULL;
> > + struct inode *inode;
> > + struct super_block *sb;
> > + struct mount *mnt;
>
> Fuck, no.
>
> > + case offsetof(struct bpf_file_info, mnt_id):
> > + /* dst = real_mount(file->f_path.mnt)->mnt_id */
> > + mnt = real_mount(LD_1(file->f_path.mnt));
> > + LD_n(mnt->mnt_id);
>
> NAK. Anything in struct mount is private to just a couple of
> files in fs/*.c. Don't do that. And keep in mind that internal
> details can and will be changed at zero notice, so be careful
> with adding such stuff.

yes. The internal details of file and inode structs can and will change.
Just like all the sk_buff internals that are exposed to bpf
via the same context rewriting mechanism.
See commit 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
that exposed first 4 fields out of sk_buff to bpf progs via
'struct __sk_buff'.
Other fields were exposed later. Some of them are not even from sk_buff.
For networking programs 'struct __sk_buff' is the context.
For this new file_filter programs 'struct bpf_file_info' is the context.
That's the only thing bpf side can see.

> Another problem is your direct poking in ->i_ino. It's not
> something directly exposed to userland at the moment and it should
> not become such.

The patch is not making i_ino directly exposed.
Only 'struct bpf_file_info' is exposed to user space / bpf programs.

In 'struct bpf_file_info' the field is called '__u64 inode'.
That is the abi.
The kernel side stays with 'unsigned long i_ino;' field.
Its size is different on 32/64 architectures, but to bpf progs
it's always seen as '__u64 inode'.
If in the future the kernel decides to change 'i_ino' to be u64,
nothing will change on bpf side.

> Filesystem has every right to have ->getattr()
> set ->ino (== ->st_ino value) in whichever way it likes;

Essentially what cgroup_file_filter_ctx_access() is doing
is the same as generic_fillattr() + cp_statx() work, but via
on the fly rewriting of bpf instructions during
loading of bpf program.

See my other reply to Andy where I argued that certain
fields like uid/gid probably don't make sense to expose
to bpf via ctx rewriter mechanism and instead they can be
done via new bpf_get_file_statx() helper.
That's future work.

> the same
> goes for ->dev.

Notice how single kernel field file->f_inode->i_sb->s_dev
is exposed to bpf via two fields dev_major and dev_minor
inside 'struct bpf_file_info'.
For dev_minor the ctx rewriting is done as:
+ case offsetof(struct bpf_file_info, dev_minor):
+ /* dst = file->f_inode->i_sb->s_dev */
+ inode = LD_1(file->f_inode);
+ sb = LD_n(inode->i_sb);
+ LD_n(sb->s_dev);
+ *insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, MINORMASK);
+ break;

If the last AND instruction was not there, the whole 's_dev' field
would have been exposed to bpf side and that would be questionable
design choice, since s_dev has kernel internal encoding of major/minor.
Instead the AND instruction is masking minor bits.
So above ctx rewriting for 'dev_minor' field is equivalent
to line
stat->dev = inode->i_sb->s_dev;
from generic_fillattr()
plus another line
tmp.stx_dev_minor = MINOR(stat->dev);
from cp_statx()

This way the kernel internal field 's_dev' is split into
two dev_major/dev_minor bpf visible fields.

The end result is that no new kernel information is exposed.
The same information is seen via statx.