Re: [GIT PULL] selinux/selinux-pr-20250323
From: Linus Torvalds
Date: Thu Mar 27 2025 - 15:46:19 EST
On Thu, 27 Mar 2025 at 12:16, Stephen Smalley
<stephen.smalley.work@xxxxxxxxx> wrote:
>
> Where could/would we cache that information so that it was accessible
> directly by the VFS layer?
So the VFS layer already does this for various other things. For this
case, the natural thing to do would be to add another IOP_xyzzy flag
in inode->i_opflags.
That's how we already say things like "this inode has no
filesystem-specific i_op->permission function" (IOP_FASTPERM), so that
we don't even have to follow the "inode->i_op->permission" pointer
chain to see a NULL pointer.
Yes, the VFS layer is *heavily* optimized like that. It literally does
that IOP_FASTPERM to avoid chasing two pointers - not even the call,
just the "don't even bother to follow pointers to see if it's NULL".
See do_inode_permission().
And we have 16 bits in that inode->i_opflags, and currently only use 7
of those bits. Adding one bit for a IOP_NO_SECURITY_LOOKUP kind of
logic (feel free to rename that - just throwing a random name out as a
suggestion) would be a complete no-brainer.
NOTE! The rule for the i_opflags accesses is that *reading* them is
done with no locking at all, but changing them takes the inode
spinlock (and we should technically probably use WRITE_ONCE() and
READ_ONCE(), but we don't).
And notice that the "no locking at all for reading" means that if you
*change* the bit - even though that involves locking - there may be
concurrent lookups in process that won't see the change, and would go
on as if the lookup still does not need any security layer call. No
serialization to readers at all (although you could wait for an RCU
period after changing if you really need to, and only use the bit in
the RCU lookup).
That should be perfectly fine - I really don't think serialization is
even needed. If somebody is changing the policy rules, any file
lookups *concurrent* to that change might not see the new rules, but
that's the same as if it happened before the change.
I just wanted to point out that the serialization is unbalanced: the
spinlock for changing the flag is literally just to make sure that two
bits being changed at the same time don't stomp on each other (because
it's a 16-bit non-atomic field, and we didn't want to use a "unsigned
long" and atomic bitops because the cache layout of the inode is also
a big issue).
And you can take the IOP_FASTPERM thing as an example of how to do
this: it is left clear initially, and what happens is that during the
permission lookup, if it *isn't* set, we'll follow those
inode->i_io->permission pointers, and notice that we should set it:
if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
if (likely(inode->i_op->permission))
return inode->i_op->permission(idmap, inode, mask);
/* This gets set once for the inode lifetime */
spin_lock(&inode->i_lock);
inode->i_opflags |= IOP_FASTPERM;
spin_unlock(&inode->i_lock);
}
and I think the security layer could take a very similar approach: not
setting that IOP_NO_SECURITY_LOOKUP initially, but *when* a
security_inode_permission() call is made with just MAY_NOT_BLOCK |
MAY_LOOKUP, and the security layer notices that "this inode has no
reason to care", it could set the bit so that *next* time around the
VFS layer won't bother to call into security_inode_permission()
unnecessarily.
Does that clarify?
Linus