Re: [PATCH 1/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
From: Vincent Guittot
Date: Thu Mar 28 2024 - 13:39:00 EST
On Thu, 28 Mar 2024 at 18:10, Vitalii Bursov <vitaly@xxxxxxxxxx> wrote:
>
>
>
> On 28.03.24 18:48, Vincent Guittot wrote:
> > On Thu, 28 Mar 2024 at 17:27, Vitalii Bursov <vitaly@xxxxxxxxxx> wrote:
> >>
> >>
> >> On 28.03.24 16:43, Vincent Guittot wrote:
> >>> On Thu, 28 Mar 2024 at 01:31, Vitalii Bursov <vitaly@xxxxxxxxxx> wrote:
> >>>>
> >>>> Change relax_domain_level checks so that it would be possible
> >>>> to exclude all domains from newidle balancing.
> >>>>
> >>>> This matches the behavior described in the documentation:
> >>>> -1 no request. use system default or follow request of others.
> >>>> 0 no search.
> >>>> 1 search siblings (hyperthreads in a core).
> >>>>
> >>>> "2" enables levels 0 and 1, level_max excludes the last (level_max)
> >>>> level, and level_max+1 includes all levels.
> >>>
> >>> I was about to say that max+1 is useless because it's the same as -1
> >>> but it's not exactly the same because it can supersede the system wide
> >>> default_relax_domain_level. I wonder if one should be able to enable
> >>> more levels than what the system has set by default.
> >>
> >> I don't know is such systems exist, but cpusets.rst suggests that
> >> increasing it beyoud the default value is possible:
> >>> If your situation is:
> >>>
> >>> - The migration costs between each cpu can be assumed considerably
> >>> small(for you) due to your special application's behavior or
> >>> special hardware support for CPU cache etc.
> >>> - The searching cost doesn't have impact(for you) or you can make
> >>> the searching cost enough small by managing cpuset to compact etc.
> >>> - The latency is required even it sacrifices cache hit rate etc.
> >>> then increasing 'sched_relax_domain_level' would benefit you.
> >
> > Fair enough. The doc should be updated as we can now clear the flags
> > but not set them
> >
>
> SD_BALANCE_NEWIDLE is always set by default in sd_init() and cleared
> in set_domain_attribute() depending on default_relax_domain_level
> ("relax_domain_level" kernel parameter) and cgroup configuration
> if it's present.
Yes, I meant that before
9ae7ab20b483 ("sched/topology: Don't set SD_BALANCE_WAKE on cpuset
domain relax")
The flags SD_BALANCE_NEWIDLE and SD_BALANCE_WAKE could also be set
even though sd_init() would not set them
>
> So, it should work both ways - clearing flags when relax level
> is decreasing, and not clearing the flag when it's increasing,
> isn't it?
>
> Also, after a closer look at set_domain_attribute(), it looks like
> default_relax_domain_level is -1 on all systems, so if cgroup does
> not set relax level, it won't clear any flags, which probably means
> that level_max+1 is redundant today.
Except if the boot parameter has set it to another level which was my
point. Does it make sense to be able to set a relax_level to level_max
in one cgroup if we have "relax_domain_level=1" in boot params as an
example ? But this is out of the scope of this patch because it
already works for level_max-1 so why not for level_max
So keep your change in update_relax_domain_level()
Thanks