Re: [PATCH RESEND 2 1/1] sched/syscalls: Allow setting niceness using sched_param struct
From: Michael Pratt
Date: Tue Nov 12 2024 - 19:30:59 EST
Hi Steven, thanks for the reply,
On Tuesday, November 12th, 2024 at 10:34, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >
> > Cc: stable@xxxxxxxxxxxxxxx # Apply to kernel/sched/core.c
>
>
> Why is stable Cc'd?
>
I believe this should be backported, if accepted,
so that the behavior between kernel versions is matching.
> > Signed-off-by: Michael C. Pratt mcpratt@xxxxx
> > ---
> > kernel/sched/syscalls.c | 21 +++++++++++++++++++--
> > 1 file changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> > index 24f9f90b6574..43eb283e6281 100644
> > --- a/kernel/sched/syscalls.c
> > +++ b/kernel/sched/syscalls.c
> > @@ -785,6 +785,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
> > attr.sched_policy = policy;
> > }
> >
> > + if (attr.sched_priority > MAX_PRIO-1)
> > + return -EINVAL;
> > +
> > + /*
> > + * If priority is set for SCHED_NORMAL or SCHED_BATCH,
> > + * set the niceness instead, but only for user calls.
> > + */
> > + if (check && attr.sched_priority > MAX_RT_PRIO-1 &&
> > + ((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) {
> > + attr.sched_nice = PRIO_TO_NICE(attr.sched_priority);
> > + attr.sched_priority = 0;
> > + }
>
>
> This really looks like a hack.
If you have what you would consider to be a "non-hack" fix
in order for standard function posix_spawnattr_setschedparam()
to be usable instead of always fail for non-RT processes, let me know...
Looking at the larger picture, because Linux translates the sched_param struct
into the local sched_attr struct, and because of changes in history, there are several
caveats to be handled. This internal preparation function _sched_setscheduler()
is essentially only consisting of "hacks", except for the member copying between structs.
> Specifically that we are exposing how the
> kernel records priority to user space. That is the greater than
> MAX_RT_PRIO-1. 120 may be the priority the kernel sees on nice values, but
> it is not something that we should expose to user space system calls.
Is the default static priority value not already exposed, perhaps everywhere???
It's not a secret, but rather common knowledge that when observing a "niceness" of 0
or a "priority" of 20 as seen through common programs like "top", that
these values actually represent 120 as the real priority value, and that
the niceness value is a simple addition to the default for the final effective value.
Also, we have in newer kernel versions, maybe or maybe not dependent on configuration,
the procfs object called "sched" which already shows the actual value.
I can do:
$ cat /proc/$$/sched
and see the 120 without needing interpretation
due to it being represented in a different way.
> That said, you are worried about the race of spawning a new task and
> setting its nice value because the new task may have exited. What about
> using pidfd? Create a task returning the pidfd and use that to set its nice
> value.
I read a little about pidfd, but I'm not seeing the exact connection here,
perhaps it will reduce the race condition but it cannot eliminate it as far as I see.
For example, I am not finding a function that uses it to adjust niceness.
It's not that the "exit before modify" race condition is the only concern,
it's just one of the less obvious factors making up my rationale for this change.
I'm also concerned with efficiency. Why do we need to call another syscall
if the syscall we are already in can handle it?
Personally, I find it strange that in sched_setscheduler()
the policy can be changed but not the priority,
when there is a standardized function dedicated to just that.
The difference between RT and normal processes
is simply that for normal processes, we use "niceness" instead,
so this patch simply translates the priority to "niceness",
which cannot be expressed separately with the relevant POSIX functions.
--
MCP