Re: [RFC 1/8] sched: Add nice value change notifier

From: Barry Song
Date: Thu Oct 07 2021 - 06:00:38 EST


On Thu, Oct 7, 2021 at 10:09 PM Tvrtko Ursulin
<tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
>
>
> On 07/10/2021 09:50, Tvrtko Ursulin wrote:
> >
> > On 06/10/2021 21:21, Barry Song wrote:
> >> On Thu, Oct 7, 2021 at 2:44 AM Tvrtko Ursulin
> >> <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
> >>>
> >>>
> >>> Hi,
> >>>
> >>> On 06/10/2021 08:58, Barry Song wrote:
> >>>> On Wed, Oct 6, 2021 at 5:15 PM Wanghui (John)
> >>>> <john.wanghui@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> HI Tvrtko
> >>>>>
> >>>>> On 2021/10/4 22:36, Tvrtko Ursulin wrote:
> >>>>>> void set_user_nice(struct task_struct *p, long nice)
> >>>>>> {
> >>>>>> bool queued, running;
> >>>>>> - int old_prio;
> >>>>>> + int old_prio, ret;
> >>>>>> struct rq_flags rf;
> >>>>>> struct rq *rq;
> >>>>>>
> >>>>>> @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p,
> >>>>>> long nice)
> >>>>>>
> >>>>>> out_unlock:
> >>>>>> task_rq_unlock(rq, p, &rf);
> >>>>>> +
> >>>>>> + ret = atomic_notifier_call_chain(&user_nice_notifier_list,
> >>>>>> nice, p);
> >>>>>> + WARN_ON_ONCE(ret != NOTIFY_DONE);
> >>>>>> }
> >>>>> How about adding a new "io_nice" to task_struct,and move the call
> >>>>> chain to
> >>>>> sched_setattr/getattr, there are two benefits:
> >>>>
> >>>> We already have an ionice for block io scheduler. hardly can this
> >>>> new io_nice
> >>>> be generic to all I/O. it seems the patchset is trying to link
> >>>> process' nice with
> >>>> GPU's scheduler, to some extent, it makes more senses than having a
> >>>> common ionice because we have a lot of IO devices in the systems, we
> >>>> don't
> >>>> know which I/O the ionice of task_struct should be applied to.
> >>>>
> >>>> Maybe we could have an ionice dedicated for GPU just like ionice for
> >>>> CFQ
> >>>> of bio/request scheduler.
> >>>
> >>> Thought crossed my mind but I couldn't see the practicality of a 3rd
> >>> nice concept. I mean even to start with I struggle a bit with the
> >>> usefulness of existing ionice vs nice. Like coming up with practical
> >>> examples of usecases where it makes sense to decouple the two
> >>> priorities.
> >>>
> >>> From a different angle I did think inheriting CPU nice makes sense for
> >>> GPU workloads. This is because today, and more so in the future,
> >>> computations on a same data set do flow from one to the other.
> >>>
> >>> Like maybe a simple example of batch image processing where CPU decodes,
> >>> GPU does a transform and then CPU encodes. Or a different mix, doesn't
> >>> really matter, since the main point it is one computing pipeline from
> >>> users point of view.
> >>>
> >>
> >> I am on it. but I am also seeing two problems here:
> >> 1. nice is not global in linux. For example, if you have two cgroups,
> >> cgroup A
> >> has more quota then cgroup B. Tasks in B won't win even if it has a
> >> lower nice.
> >> cgroups will run proportional-weight time-based division of CPU.
> >>
> >> 2. Historically, we had dynamic nice which was adjusted based on the
> >> average
> >> sleep/running time; right now, we don't have dynamic nice, but virtual
> >> time
> >> still make tasks which sleep more preempt other tasks with the same nice
> >> or even lower nice.
> >> virtual time += physical time/weight by nice
> >> so, static nice number doesn't always make sense to decide preemption.
> >>
> >> So it seems your patch only works under some simple situation for example
> >> no cgroups, tasks have similar sleep/running time.
> >
> > Yes, I broadly agree with your assessment. Although there are plans for
> > adding cgroup support to i915 scheduling, I doubt as fine grained
> > control and exact semantics as there are on the CPU side will happen.
> >
> > Mostly because the drive seems to be for more micro-controller managed
> > scheduling which adds further challenges in connecting the two sides
> > together.
> >
> > But when you say it is a problem, I would characterize it more a
> > weakness in terms of being only a subset of possible control. It is
> > still richer (better?) than what currently exists and as demonstrated
> > with benchmarks in my cover letter it can deliver improvements in user
> > experience. If in the mid term future we can extend it with cgroup
> > support then the concept should still apply and get closer to how you
> > described nice works in the CPU world.
> >
> > Main question in my mind is whether the idea of adding the
> > sched_attr/priority notifier to the kernel can be justified. Because as
> > mentioned before, everything apart from adjusting currently running GPU
> > jobs could be done purely in userspace. Stack changes would be quite
> > extensive and all, but that is not usually a good enough reason to put
> > something in the kernel. That's why it is an RFC an invitation to discuss.
> >
> > Even ionice inherits from nice (see task_nice_ioprio()) so I think
> > argument can be made for drivers as well.
>
> Now that I wrote this, I had a little bit of a light bulb moment. If I
> abandon the idea of adjusting the priority of already submitted work
> items, then I can do much of what I want purely from within the confines
> of i915.
>
> I simply add code to inherit from current task nice on every new work
> item submission. This should probably bring the majority of the benefit
> I measured.

I think the idea makes sense to link the process's priority with the GPU's
scheduler. I have no doubt about this.
My question is more of what is the best way to implement this.

Android has bg_non_interactive cgroup with much lower weight for
background processes. interactive tasks, on the other hand, are placed
in another cgroup with much higer weight. So Android depends on
cgroup to improve user experience.

Chrome browser in your cover-letter uses nice to de-prioritise background
tabs. this works perfectly as the whole chrome should be in the same
cgroup, so changing nice will improve/decrease the resource gotten by
tasks in this cgroup. But once we have two cgroups, bringing this nice
belonging to the cgroup to the global scheduler of GPU will somehow
break the aim.

For example, if we have two cgroup A and B
/sys/fs/cgroup/cpu$ sudo sh -c 'echo 4096 > A/cpu.shares'
/sys/fs/cgroup/cpu$ sudo sh -c 'echo 512 > B/cpu.shares'

task in B with lower nice will get more GPU than task in A. But actually A group
has 8X weight of B. So the result seems wrong. especially real users like
Android does depend on cgroup.
I don't know how to overcome this "weakness", it seems not easy.

>
> Regards,
>
> Tvrtko

Thanks
barry