Re: [GIT PULL] selinux/selinux-pr-20250323
From: Linus Torvalds
Date: Thu Mar 27 2025 - 14:16:31 EST
On Thu, 27 Mar 2025 at 09:55, Stephen Smalley
<stephen.smalley.work@xxxxxxxxx> wrote:
>
> If you have constructive suggestions (or patches!) to improve
> performance of LSM and/or SELinux, we'd be glad to take them. Or even
> helpful hints on how to best measure and see the same overheads you
> are seeing and where.
So measuring it is fairly easy. I have various loads I care about, but
the simplest one that I feel is actually a real load - rather than the
more artificial bencmarks I then use for verification when I make any
changes - is literally "do an allmodconfig kernel rebuild with no
changes".
That empty kernel rebuild approximates what I actually do for most
small pull requests when only a couple of files really get re-built. I
don't want to actually try to profile user space and the actual
compiler overhead.
To see that load, first do this as root:
echo -1 > /proc/sys/kernel/perf_event_paranoid
so that you as a regular user can then do a kernel build and get good
kernel-level profiles (there are other ways you can do this: you can
obviously also do a full profile as root). Obviously you should *not*
do this on a machine with other users, the above basically says "let
anybody do profiling on kernel".
Then I just do
make allmodconfig
make -j64
to prep the tree and causing everything to be built (well - normally I
obviously don't do that, because my kernel tree is always built
anyway, but I'm just trying to make it obvious how to reproduce it).
And then I just do
perf record -e cycles:pp make -j64 > ../makes
perf report --sort=symbol,dso
and press 'k' to just get the kernel side (again - there's little I
can do, or care, about the user space profiles).
The "--sort=symbol,dso" is because I don't care _which_ process it is
that does what, so I just want the output binned by kernel function.
Just on a very high level, this is what I get with the v6.14 tree
(cut-off at 0.25%, this is "out of total cost" for the load including
user space, so these top functions account for just over 9% of the
*total* cost of the benchmark):
1.26% [k] clear_page_rep
0.82% [k] avc_has_perm_noaudit
0.73% [k] link_path_walk
0.73% [k] terminate_walk
0.58% [k] __d_lookup_rcu
0.56% [k] step_into
0.52% [k] selinux_inode_permission
0.50% [k] memset_orig
0.49% [k] vfs_statx_path
0.47% [k] strncpy_from_user
0.47% [k] rep_movs_alternative
0.37% [k] vfs_statx
0.31% [k] __rcu_read_unlock
0.30% [k] btrfs_getattr
0.28% [k] inode_permission
0.26% [k] kmem_cache_free
0.26% [k] generic_permission
[...]
so the top thing is the page clearing (and you see other memcpy/memset
variations there too), but the #2 hit for the kernel profile is
selinux, which takes more time than the basic path walking.
And selinux_inode_permission() is rather high up there too, as you can
see. Together, those two functions are about 1.3% of the whole load.
Now, the above profile is just from *my* machine, and
microarchitecture will matter a *LOT*. So the details will depend
hugely on your hardware, but I've been doing kernel profiles for
decades, and the basics haven't really changed. memory movement and
clearing is universally the biggest thing, and that's fine. It's
fundamental.
Also, when I do profiles I turn off the CPU mitigations, because again
depending on microarchitecture those can just swamp everything else,
and while they are a real overhead, from a performance standpoint I'm
hoping they are something that long-term is going to be mostly fixed
in hardware (apart from the basic Spectre-v1 branch speculation, which
is *not* turned off in my kerrels, and which we've actually worked
fairly hard on making sure is handled efficiently).
Now, looking at instruction level profiles is kind of iffy, and you
have to know your microarchitecture to really make sense of them. The
"cycles:pp" helps make profiles more relevant (and requires PEBS/IBS
or equivalent CPU support to work), but it won't replace "you have to
understand hardware".
You do want to look at instruction profiles at least a bit, partly
because inlining makes _not_ looking at them often kind of misleading.
The real cost may be in a function that was inlined.
Typically, once you look at instruction-level profiles, and understand
them, you'll see one of three issues:
- cache misses. This is typically the big obvious one.
And you'll see them both for I$ and D$. People will tell you that
I$ cache misses are in the noise, but people are wrong. It's simply
not true for the kernel or many other real benchmarks, and you'll
often see it as big hits at the beginnings of functions - or at the
return points of calls - where the instructions otherwise look very
benign.
- serialization. This shows up hugely on modern CPUs, so any memory
barriers etc (ie locked instructions on x86) will stand out.
- branch misprediction. This will typically show up in the profiles
not on the branch, but on the mispredicted _target_ of the branch, so
it can end up being a bit confusing. The CPU speculation mitigations
typically turn this issue up to 11 and add misprediction noise
absolutely everywhere, which is why turning those off is such a big
deal.
but in an OoO CPU all of the above will basically result in various
instruction profile "patterns", so you in general cannot really look
at individual instructions, and should use the above patterns to try
to figure out *why* the profile looks like it does.
It's not obvious, and the patterns will be different for different
microarchitectures. You can use fancier perf things to try to figure
out exactly what is going on, but you should always _start_ from the
"where are the costs" on a pure basic cycle basis. Only after that
does it make sense to say something like "Oh, this is expensive and
seems to be taking excessive cache misses, let's drill down into why".
Also, typically, code that has already been tweaked to death tends to
show fewer obvious peaks in the profile.
Because the obvious peaks have often been getting some attention. So
the profile ends up not showing a big read flag any more, because the
big issue has been fixed and now it's mostly a "it's called too much"
issue.
For the security layer, at least historically the big cache miss (on
this load) has been the inode->i_security access (not loading the
pointer itself, but the accesses following it), and the hash tables
for that AVC lookup.
And both been improved upon, and I didn't do try to analyze the above
profiles any closer when it comes to exactly what is going on, so take
that with the grain of salt it deserves. The exact details may have
changed, but as you can see, avc_has_perm_noaudit() really is very
much a top offender today.
And yes, the reason is that we have to call it a *lot* for any
filename lookups. Some of those security hooks get called for every
path component, others get called only for the final one.
The best fix would be to be able to cache the "this doesn't have any
extra security rules outside of the regular POSIX ones" and avoid
calling the hook entirely. That's what we've done for the ACL path,
and that has turned ACL costs into almost a non-issue.
Linus