Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
From: Quentin Perret
Date: Mon Jun 21 2021 - 06:52:46 EST
Hi Qais,
Apologies for the delayed reply, I was away last week.
On Monday 14 Jun 2021 at 16:03:27 (+0100), Qais Yousef wrote:
> On 06/11/21 14:43, Quentin Perret wrote:
> > On Friday 11 Jun 2021 at 15:17:37 (+0100), Qais Yousef wrote:
> > > On 06/11/21 13:49, Quentin Perret wrote:
> > > > Thinking about it a bit more, a more involved option would be to have
> > > > this patch as is, but to also introduce a new RLIMIT_UCLAMP on top of
> > > > it. The semantics could be:
> > > >
> > > > - if the clamp requested by the non-privileged task is lower than its
> > > > existing clamp, then allow;
> > > > - otherwise, if the requested clamp is less than UCLAMP_RLIMIT, then
> > > > allow;
> > > > - otherwise, deny,
> > > >
> > > > And the same principle would apply to both uclamp.min and uclamp.max,
> > > > and UCLAMP_RLIMIT would default to 0.
> > > >
> > > > Thoughts?
> > >
> > > That could work. But then I'd prefer your patch to go as-is. I don't think
> > > uclamp can do with this extra complexity in using it.
> >
> > Sorry I'm not sure what you mean here?
>
> Hmm. I understood this as a new flag to sched_setattr() syscall first, but now
> I get it. You want to use getrlimit()/setrlimit()/prlimit() API to impose
> a restriction. My comment was in regard to this being a sys call extension,
> which it isn't. So please ignore it.
>
> >
> > > We basically want to specify we want to be paranoid about uclamp CAP or not. In
> > > my view that is simple and can't see why it would be a big deal to have
> > > a procfs entry to define the level of paranoia the system wants to impose. If
> > > it is a big deal though (would love to hear the arguments);
> >
> > Not saying it's a big deal, but I think there are a few arguments in
> > favor of using rlimit instead of a sysfs knob. It allows for a much
> > finer grain configuration -- constraints can be set per-task as well as
> > system wide if needed, and it is the standard way of limiting resources
> > that tasks can ask for.
>
> Is it system wide or per user?
Right, so calling this 'system-wide' is probably an abuse, but IIRC
rlimits are per-process, and are inherited accross fork/exec. So the
usual trick to have a default value is to set the rlimits on the init
task accordingly. Android for instance already does that for a few
things, and I would guess that systemd and friends have equivalents
(though admittedly I should check that).
> >
> > > requiring apps that
> > > want to self regulate to have CAP_SYS_NICE is better approach.
> >
> > Rlimit wouldn't require that though, which is also nice as CAP_SYS_NICE
> > grants you a lot more power than just clamps ...
>
> Now I better understand your suggestion. It seems a viable option I agree.
> I need to digest it more still though. The devil is in the details :)
>
> Shouldn't the default be RLIM_INIFINITY? ie: no limit?
I guess so yes.
> We will need to add two limit, RLIMIT_UCLAMP_MIN/MAX, right?
Not sure, but I was originally envisioning to have only one that applies
to both min and max. In which would we need separate ones?
> We have the following hierarchy now:
>
> 1. System Wide (/proc/sys/kerenl/sched_util_clamp_min/max)
> 2. Cgroup
> 3. Per-Task
>
> In that order of priority where 1 limits/overrides 2 and 3. And
> 2 limits/overrides 3.
>
> Where do you see the RLIMIT fit in this hierarchy? It should be between 2 and
> 3, right? Cgroup settings should still win even if the user/processes were
> limited?
Yes, the rlimit stuff would just apply the syscall interface.
> If the framework decided a user can't request any boost at all (can't increase
> its uclamp_min above 0). IIUC then setting the hard limit of RLIMIT_UCLAMP_MIN
> to 0 would achieve that, right?
Exactly.
> Since the framework and the task itself would go through the same
> sched_setattr() call, how would the framework circumvent this limit? IIUC it
> has to raise the RLIMIT_UCLAMP_MIN first then perform sched_setattr() to
> request the boost value, right? Would this overhead be acceptable? It looks
> considerable to me.
The framework needs to have CAP_SYS_NICE to change another process'
clamps, and generally rlimit checks don't apply to CAP_SYS_NICE-capable
processes -- see __sched_setscheduler(). So I think we should be fine.
IOW, rlimits are just constraining what unprivileged tasks are allowed
to request for themselves IIUC.
> Also, Will prlimit() allow you to go outside what was set for the user via
> setrlimit()? Reading the man pages it seems to override, so that should be
> fine.
IIRC rlimit are per-process properties, not per-user, so I think we
should be fine here as well?
> For 1 (System Wide) limits, sched_setattr() requests are accepted, but the
> effective uclamp is *capped by* the system wide limit.
>
> Were you thinking RLIMIT_UCLAMP* will behave similarly?
Nope, I was actually thinking of having the syscall return -EPERM in
this case, as we already do for nice values or RT priorities.
> If they do, we have
> consistent behavior with how the current system wide limits work; but this will
> break your use case because tasks can change the requested uclamp value for
> a task, albeit the effective value will be limited.
>
> RLIMIT_UCLAMP_MIN=512
> p->uclamp[UCLAMP_min] = 800 // this request is allowed but
> // Effective UCLAMP_MIN = 512
>
> If not, then
>
> RLIMIT_UCLAMP_MIN=no limit
> p->uclamp[UCLAMP_min] = 800 // task changed its uclamp_min to 800
> RLIMIT_UCLAMP_MIN=512 // limit was lowered for task/user
>
> what will happen to p->uclamp[UCLAMP_MIN] in this case? Will it be lowered to
> match the new limit? And this will be inconsistent with the current system wide
> limits we already have.
As per the above, if the syscall returns -EPERM we can leave the
integration with system-wide defaults and such untouched I think.
> Sorry too many questions. I was mainly thinking loudly. I need to spend more
> time to dig into the details of how RLIMITs are imposed to understand how this
> could be a good fit. I already see some friction points that needs more
> thinking.
No need to apologize, this would be a new userspace-visible interface,
so you're right that we need to think it through.
Thanks for the feedback,
Quentin