Re: [PATCH v2 2/2] vfs: avoid duplicating creds in faccessat if possible

From: Mateusz Guzik
Date: Fri Jan 20 2023 - 20:19:02 EST


On 1/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> On 1/20/23, Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
>> On Mon, Jan 16, 2023 at 4:21 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>>>
>>> access(2) remains commonly used, for example on exec:
>>> access("/etc/ld.so.preload", R_OK)
>>>
>>> or when running gcc: strace -c gcc empty.c
>>> % time seconds usecs/call calls errors syscall
>>> ------ ----------- ----------- --------- --------- ----------------
>>> 0.00 0.000000 0 42 26 access
>>>
>>> It falls down to do_faccessat without the AT_EACCESS flag, which in turn
>>> results in allocation of new creds in order to modify fsuid/fsgid and
>>> caps. This is a very expensive process single-threaded and most notably
>>> multi-threaded, with numerous structures getting refed and unrefed on
>>> imminent new cred destruction.
>>>
>>> Turns out for typical consumers the resulting creds would be identical
>>> and this can be checked upfront, avoiding the hard work.
>>>
>>> An access benchmark plugged into will-it-scale running on Cascade Lake
>>> shows:
>>> test proc before after
>>> access1 1 1310582 2908735 (+121%) # distinct files
>>> access1 24 4716491 63822173 (+1353%) # distinct files
>>> access2 24 2378041 5370335 (+125%) # same file
>>
>> Out of curiosity, do you have any measurements of the impact this
>> patch has on the AT_EACCESS case when the creds do need to be
>> modified?
>>
>
> I could not be arsed to bench that. I'm not saying there is literally 0
> impact, but it should not be high and the massive win in the case I
> patched imho justifies it.
>
> Last week I got a private reply from Linus suggesting the new checks
> only happen once at commit_cred() time, which would mean there would be
> at most one extra branch for the case you are concerned with. However,
> this quickly turn out to be rather hairy as there are games being
> played for example in copy_creds() which assigns them *without* calling
> commit_creds(). I was not comfortable pre-computing without sorting out
> the mess first and he conceded the new branchfest is not necessarily a
> big deal.
>
> That said, if you want some performance recovered for this case, here
> is an easy one:
>
> static const struct cred *access_override_creds(void)
> [..]
> old_cred = override_creds(override_cred);
>
> /* override_cred() gets its own ref */
> put_cred(override_cred);
>
> As in the new creds get refed only to get unrefed immediately after.
> Whacking the 2 atomics should make up for it no problem on x86-64.
>
> Also see below.
>
>>> The above benchmarks are not integrated into will-it-scale, but can be
>>> found in a pull request:
>>> https://github.com/antonblanchard/will-it-scale/pull/36/files
>>>
>>> Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx>
>>>
>>> v2:
>>> - fix current->cred usage warn reported by the kernel test robot
>>> Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@xxxxxxxxx/
>>> ---
>>> fs/open.c | 32 +++++++++++++++++++++++++++++++-
>>> 1 file changed, 31 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/open.c b/fs/open.c
>>> index 82c1a28b3308..3c068a38044c 100644
>>> --- a/fs/open.c
>>> +++ b/fs/open.c
>>> @@ -367,7 +367,37 @@ COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int,
>>> mode,
>>> compat_arg_u64_dual(offset
>>> * access() needs to use the real uid/gid, not the effective uid/gid.
>>> * We do this by temporarily clearing all FS-related capabilities and
>>> * switching the fsuid/fsgid around to the real ones.
>>> + *
>>> + * Creating new credentials is expensive, so we try to skip doing it,
>>> + * which we can if the result would match what we already got.
>>> */
>>> +static bool access_need_override_creds(int flags)
>>> +{
>>> + const struct cred *cred;
>>> +
>>> + if (flags & AT_EACCESS)
>>> + return false;
>>> +
>>> + cred = current_cred();
>>> + if (!uid_eq(cred->fsuid, cred->uid) ||
>>> + !gid_eq(cred->fsgid, cred->gid))
>>> + return true;
>>> +
>>> + if (!issecure(SECURE_NO_SETUID_FIXUP)) {
>>> + kuid_t root_uid = make_kuid(cred->user_ns, 0);
>>> + if (!uid_eq(cred->uid, root_uid)) {
>>> + if (!cap_isclear(cred->cap_effective))
>>> + return true;
>>> + } else {
>>> + if (!cap_isidentical(cred->cap_effective,
>>> + cred->cap_permitted))
>>> + return true;
>>> + }
>>> + }
>>> +
>>> + return false;
>>> +}
>>
>> I worry a little that with nothing connecting
>> access_need_override_creds() to access_override_creds() there is a bug
>> waiting to happen if/when only one of the functions is updated.
>>
>
> These funcs are literally next to each other, I don't think that is easy
> to miss. I concede a comment in access_override_creds to take a look at
> access_need_override_creds would not hurt, but I don't know if a resend
> to add it is justified.
>
>> Given the limited credential changes in access_override_creds(), I
>> wonder if a better solution would be to see if we could create a
>> light(er)weight prepare_creds()/override_creds() that would avoid some
>> of the prepare_creds() hotspots (I'm assuming that is where most of
>> the time is being spent). It's possible this could help improve the
>> performance of other, similar operations that need to modify task
>> creds for a brief, and synchronous, period of time.
>>
>
> So the fundamental problem here is that several refs need to be grabbed
> to make fully-fledged credentials. Then, as you are done, you have to
> undo them. This clearly sucks single-threaded. As other threads copying
> the same creds do the same atomics on the same vars, this sucks even
> more multithreaded.
>
> As a cop out one may notice several of these are probably always the
> same for creds derived from given base, so perhaps there can be an obj
> which wraps them and then you only have to ref *that* obj to implicitly
> hold on to the entire thing.
>
> As in this (and more):
> get_uid(new->user);
> get_user_ns(new->user_ns);
> get_group_info(new->group_info);
>
> would get replaced with:
> new->fancy_obj = getref_fancy_obj(new->fancy_obj);
> /* populate pointers here */
>
> ... conceptually at least.
>
> But even then, while less shafted both multi and single-threaded, there
> is still a bottleneck.
>
> For a Real Solution(tm) for a general case I think has to start with an
> observartion creds either persist for a long time *OR* keep getting
> recreated. This would suggest holding on to them and looking them up
> instead just allocating, but all this opens another can of worms and
> I don't believe is worth the effort at this stage. But maybe someone
> has a better idea.
>
> That said, for the case of access(), I had the following in mind but
> once more considered it not justified at this stage.
>
> pseudocode-wise:
> struct cred *prepare_shallow_creds(void)
> new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
> old = task->cred;
> memcpy(new, old, sizeof(struct cred));
>
> here new creds have all the same pointers as old, but the target objs
> are only kept alive by the old creds still refing them. So by API
> contract you are required to keep them around.
>
> after you temporarily assign them you call revert_shallow_creds():
> if (tempcred->usage == 1)
> /* nobody refed them, do the non_rcu check */
> ...
> else
> /* somebody grabbed them, legitimize creds by
> * grabbing the missing refs
> */
> get_uid(new->user);
> get_user_ns(new->user_ns);
> get_group_info(new->group_info);
> /* and so on */
>
> So this would shave work from the case you are concerned with probably
> for all calls.
>
> I do think this is an ok idea overall, but I felt like overengineering for
> the
> problem at hand *at the time*.
>
> For some context, I'm looking at performance of certain VFS stuff and
> there is some serious fish to fry in the fstat department. The patch I
> posted is definitely worthwhile perf-wise and easy enough to reason
> about that I did no expect much opposition to it. If anything I expected
> opposition to the idea outlined above.
>

So just in case, if ultimately the consensus is that shallow copy or similar
is the way to go, I can take a stab some time next week.

>> Have you done any profiling inside of access_override_creds() to see
>> where the most time is spent? Looking at the code I have some gut
>> feelings on the hotspots, but it would be good to see some proper data
>> before jumping to any conclusions.
>>
>
> See above. It's the atomics all the way down.
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>


--
Mateusz Guzik <mjguzik gmail.com>