Re: [PATCH] sched/pi: Reweight fair_policy() tasks when inheriting prio

From: Vincent Guittot
Date: Tue Apr 09 2024 - 08:35:50 EST


On Tue, 9 Apr 2024 at 08:19, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> On 04/08/24 12:51, John Stultz wrote:
> > On Mon, Apr 8, 2024 at 12:17 AM Vincent Guittot
> > <vincent.guittot@xxxxxxxxxx> wrote:
> > >
> > > On Sun, 7 Apr 2024 at 14:27, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> > > >
> > > > On 04/05/24 18:16, Qais Yousef wrote:
> > > >
> > > > > >
> > > > > > All that to say that I think the weight is not applied on purpose.
> > > > > > This might work for your particular case but there are more changes to
> > > > > > be done if you want to apply prio inheritance between cfs tasks.
> > > > > >
> > > > > > As an example, what about the impact of cgroup on the actual weight
> > > > > > and the inherited priority of a task ? If the owner and the waiter
> > > > > > don't belong to the same cgroup their own prio is meaningless.. task
> > > > > > nice -20 in a group with a weight equal to nice 19 vs a task nice 19
> > > > > > in a group with a weight equals to nice -20
> > > > >
> > > > > That is on my mind actually. But I thought it's a separate problem. That has to
> > > > > do with how we calculate the effective priority of the pi_task. And probably
> > > > > the sorting order to if we agree we need to revert the above. If that is done
> > > >
> > > > Thinking more about it the revert is not the right thing to do. We want fair
> > > > tasks to stay ordered in FIFO for better fairness and avoid potential
> > > > starvation issues. It's just the logic for searching the top_waiter need to be
> > > > different. If the top_waiter is fair, then we need to traverse the tree to find
> > > > the highest nice value. We probably can keep track of this while adding items
> > > > to the tree to avoid the search.
> > > >
> > > > For cgroup; is it reasonable (loosely speaking) to keep track of pi_cfs_rq and
> > > > detach_attach_task_cfs_rq() before the reweight? This seems the most
> > > > straightforward solution and will contain the complexity to keeping track of
> > > > cfs_rq. But it'll have similar issue to proxy execution where a task that
> > > > doesn't belong to the cgroup will consume its share..
> > >
> > > That's a good point, Would proxy execution be the simplest way to fix all this ?
>
> Is it? Over 4.5 years ago Unity reported to me about performance inversion
> problem and that's when proxy execution work was revived as simplest way to fix
> all of this. But still no end in sight from what I see. I was and still think
> an interim solution in rt_mutex could help a lot of use cases already without
> being too complex. Not as elegant and comprehensive like proxy execution, but
> given the impact on both userspace and out of tree kernel hacks are growing
> waiting for this to be ready, the cost of waiting is high IMHO.
>
> FWIW, I already heard several feedbacks that PTHREAD_PRIO_INHERIT does nothing.
> I think this reweight issue is more serious problem and likely why I heard this
> feedback. I could be underestimating the complexity of the fix though. So I'll

Without cgroup, the solution could be straightforward but android uses
extensively cgroup AFAICT and update_cfs_group() makes impossible to
track the top cfs waiter and its "prio"

> trust your judgement and ditch further effort if you think it's more effort
> than helping proxy execution patchset - for the list at least. I'm likely still
> to pursue something out of tree to get into as many android LTS kernels. And
> will be happy to share this work if there's desire to try to pick something up
> for mainline to make the problem less severe at least.
>
> Note that binder has already performance, latency (out of tree) and priority
> inheritance and without those performance is impacted in many corner cases and
> considered indispensable part.
>
> >
> > So, at the moment, in part. It ought to resolve the issue for
> > in-kernel mutexes (blocked tasks stay on rq, if blocked tasks are
> > selected to run we will instead run the runnable lock owner - thus it
> > works across scheduling classes), but it isn't tied into userland
> > futexes the way rt_mutexes are at this point.
> >
> > Review and feedback on the series would be greatly appreciated!
> > (Nudge! Nudge! :)
>
> I am guilty of this, sorry.

me too