Re: [GIT PULL] SELinux patches for v5.8
From: Casey Schaufler
Date: Wed Jun 03 2020 - 18:00:14 EST
On 6/3/2020 2:02 PM, Alexei Starovoitov wrote:
> On Wed, Jun 03, 2020 at 10:37:47AM -0700, Linus Torvalds wrote:
>> On Wed, Jun 3, 2020 at 10:20 AM Casey Schaufler <casey@xxxxxxxxxxxxxxxx> wrote:
>>> We could have inode->i_security be the blob, rather than a pointer to it.
>>> That will have its own performance issues.
>> It wouldn't actually really fix anything, because the inode is so big
>> and sparsely accessed that it doesn't even really help the cache
>> density issue. Yeah, it gets rid of the pointer access, but that's
>> pretty much it. The fact that we randomize the order means that we
>> can't even really try to aim for any cache density.
>>
>> And it would actually not be possible with the current layered
>> security model anyway, since those blob sizes are dynamic at runtime.
>>
>> If we had _only_ SELinux, we could perhaps have hidden the
>> sid/sclass/task_sid directly in the inode (it would be only slightly
>> larger than the pointer is, anyway), but even that ship sailed long
>> long ago due to the whole "no security person can ever agree with
>> another one on fundamentals".
> Also there is bpf_lsm now that we're going to run it in production,
> so performance is as important as ever.
> Traditional lsm-s have per-lsm per-inode blob.
> For bpf that doesn't work, since progs come and go at run-time and
> independent from each other.
The inode blob is for attributes associated with the filesystem object.
There are cred and task blobs for program information.
If you need separate per-task data you should put it there.
> So we need per-program per-inode blob.
> To maintain good performance we've proposed:
> @@ -740,6 +741,10 @@ struct inode {
> struct fsverity_info *i_verity_info;
> #endif
>
> +#ifdef CONFIG_BPF_SYSCALL
> + struct bpf_local_storage __rcu *inode_bpf_storage;
> +#endif
>
> https://patchwork.ozlabs.org/project/netdev/patch/20200526163336.63653-3-kpsingh@xxxxxxxxxxxx/
>
> but got pushback, so we're going to use lsm style for now:
> +static inline struct bpf_lsm_storage *bpf_inode(const struct inode *inode)
> +{
> + if (unlikely(!inode->i_security))
> + return NULL;
> +
> + return inode->i_security + bpf_lsm_blob_sizes.lbs_inode;
> +}
>
> which means extra kmalloc for every inode, extra pointer deref, global var
> access, and additional math just to get to 'inode_bpf_storage' pointer.
>
> We have similar pointer in 'struct sock' already:
> #ifdef CONFIG_BPF_SYSCALL
> struct bpf_sk_storage __rcu *sk_bpf_storage;
> #endif
> that is used by variety of networking bpf programs.
> The commit 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
> has benchmarking data for it:
> hash table with 8-byte key -> 152ns per bpf run
> sk_bpf_storage -> 66ns per bpf run
> Hashtable suppose to be O(1) with L1$ hit, but it turned out
> to be noticeably slower than sk_bpf_storage.
> We expect to see similar performance gains for inode_bpf_storage
> vs hashtable approach that people use now.
> Eventually we'll add task_bpf_storage as well.
> Right now every other bpf tracing script is using pid as a key
> in a separate hash table to store per-task data. For high frequency
> events that adds up. task_bpf_storage will accelerate that.
Why aren't you using a task blob? We have support for what
you need.
> Another way to look at it is shared inode->i_security across
> different inodes won't work for us. We need something really
> cheap like single 'inode_bpf_storage' pointer that is zero
> most of the time and for few inodes bpf progs will keep their
> scratch data in there.
> For now lsm style bpf_inode() approach is ok-ish.
> But we will come back when we collect perf numbers to justify
> why direct pointer in the 'struct inode' is a win.
It would be really helpful if instead of "the regular mechanism
is too slow, so we abandoned it" we could do "the regular mechanism
was to slow, so we made it better". I appreciate that every bit of
performance matters. That's true for all of the security modules,
which is why I object to special case tricks to boost the performance
of one module at the expense of everything else.