Re: [GIT PULL] isolation: 1Hz residual tick offloading v3

From: Luiz Capitulino
Date: Tue Jan 16 2018 - 11:52:48 EST


On Tue, 16 Jan 2018 16:41:00 +0100
Frederic Weisbecker <frederic@xxxxxxxxxx> wrote:

> On Fri, Jan 12, 2018 at 02:18:13PM -0500, Luiz Capitulino wrote:
> > On Thu, 4 Jan 2018 05:25:32 +0100
> > Frederic Weisbecker <frederic@xxxxxxxxxx> wrote:
> >
> > > Ingo,
> > >
> > > Please pull the sched/0hz branch that can be found at:
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > sched/0hz
> > >
> > > HEAD: 9e932b2cc707209febd130978a5eb9f4a943a3f4
> > >
> > > --
> > > Now that scheduler_tick() has become resilient towards the absence of
> > > ticks, current->sched_class->task_tick() is the last piece that needs
> > > at least 1Hz tick to keep scheduler stats alive.
> > >
> > > This patchset adds a flag to the isolcpus boot option to offload the
> > > residual 1Hz tick. This way the nohz_full CPUs don't have anymore tick
> > > (assuming nothing else requires it) as their residual 1Hz tick is
> > > offloaded to the housekeepers.
> > >
> > > For quick testing, say on CPUs 1-7:
> > >
> > > "isolcpus=nohz_offload,domain,1-7"
> >
> > Sorry for being very late to this series, but I've a few comments to
> > make (one right now and others in individual patches).
> >
> > Why are extending isolcpus= given that it's a deprecated interface?
> > Some people have already moved away from isolcpus= now, but with this
> > new feature they will be forced back to using it.
>
> I tried to remove isolcpus or at least change the way it works so that its
> effects are reversible (ie: affine the init task instead of isolating domains)
> but that got nacked due to the behaviour's expectations for userspace.
>
> That's when I realized that kernel parameters are like userspace ABIs,
> they can't be removed easily whether we deprecate them or not.
>
> Also I needed to be able to control the various isolation features, and
> nohz_full is the wrong place to do that as nohz_full is really just an
> isolation feature like the others, nohz_full= should really just imply
> full dynticks and not watchdog, workqueue or tilegx NAPI isolation...

Yeah, I completely agree with that.

> So isolcpus= is now the place where we control the isolation features
> and nohz is one of them.

That's the part I'm not very sure about. We've been advising users to
move away from isolcpus= when possible, but this very wanted nohz_offload
feature will force everyone back to using isolcpus= again.

I have the impression this series is trying to solve two problems:

1. How (and where) we control the various isolation features in the
kernel

2. Where we add the control for the tick offload feature

I think item 1 is too complex to solve right now. IMHO, this series
should focus on item 2. And regarding item 2, I think we have two
choices to make:

1. Make tick offload a first class citizen by making it default to
nohz_full=. If there are regressions, we handle them

2. Add a new option to nohz_full=, like nohz_full=tick_offload

As an avid user of nohz_full I'm dying to see option 1 happening,
but I'm not totally sure what the consequences can be.

Another idea is to add CONFIG_NOHZ_TICK_OFFLOAD as an experimental
feature.

> The complain about isolcpus is the immutable result. I'm thinking about
> making it modifiable to cpuset but I only see two possible solutions:
>
> - Make the root cpuset modifiable
> - Create a directory called "isolcpus" visible on the first cpuset mount
> and move all processes there.

So, if we move the control of the tick offload to nohz_full= itself,
we can completely ditch any isolcpus= change in this series.

I think this should give you a great relief :)

> > What about just adding the new functionality to nohz_full=? That is,
> > no new options, just make the tick go away since this has always been
> > what nohz_full= was intended to do?
>
> We can, or have isolcpus=nohz to do it, as both do almost the same.
>
> But I'm afraid about the overhead for people used to nohz_full= once
> they upgrade their kernels and see those workqueues once per second.
>
> We can still affine those workqueues (in fact the whole unbound workqueue
> mask) outside the nohz_full range. Still current users may be surprised
> about that new overhead on housekeeping CPUs...
>