Re: [RFC PATCH v2] sched_pair_cpu: Introduce scheduler task pairing system call

From: Peter Zijlstra
Date: Fri Jun 26 2020 - 12:01:07 EST


On Thu, Jun 25, 2020 at 10:56:35AM -0400, Mathieu Desnoyers wrote:
> ----- On Jun 24, 2020, at 3:50 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:

I'll try and read the earlier bit later, I can't think today.

> > That's exactly what that signal would do. It would send SIGIO when the
> > state changes.
> >
> > So you want to access CPU-n's data, you open that file, register a
> > signal and read it's state, if offline, you good, do the rseq. If it
> > suddenly decides to come back online, you're guaranteed that SIGIO
> > before it reaches userspace.
> >
> > The nice thing is that it's all R/O so available to normal users, you
> > don't have to write to the file.
>
> So let's say you have two threads trying to access (offline) CPU-n's data
> with that algorithm concurrently. How are they serialized with each other ?

Also implement F_SETLK or something :-)

> >> We do not want to override the affinity restricted by cgroups because
> >> we don't want to hurt performance characteristics of another partition
> >> of the system.
> >>
> >> The sched_pair_cpu approach has the benefit of allowing us to touch
> >> per-cpu data of a given CPU without requiring to run on that CPU, which
> >> ensures that we do not thrash the cpu cache of cpus on which a thread
> >> is not allowed to run. It takes care of issues caused by both cgroup
> >> cpusets and cpu hotplug.
> >
> > But now I worry that your thing allows escaping the cgroup contraints,
> > you can perturb random CPUs you're not allowed on. That's a really bad
> > 'feature'.
> >
> > Offline cpus are okay, because you don't actually need to do anything as
> > long as they're offline, but restricted CPUs we really should not be
> > touching, not even a little.
>
> With sched_pair_cpu, the paired task never needs to run on the target CPU.
> The kworker thread runs on the target CPU in the same way other existing
> worker threads run today, e.g. the ones handling RCU callbacks. AFAIK the
> priority of those threads can be configured by a system administrator.

Ah, but the critical difference is that all those are only ever ran if
the initial work was initialized on _that_ CPU to begin with. Consider
an isolated CPU that's spinning in userspace, it would _never_ get any
kthreads running.

Except now you can, and you even want this system call to be unpriv.

It utterly and completely wrecks NOHZ_FULL.

> Are there additional steps we should take to minimize the impact of this
> worker thread ? In the same way "no rcu callbacks" CPU can be configured
> at boot time, we could have "no sched pair cpu" configured at boot, which
> would prevent sched_pair_cpu system calls from targeting that CPU entirely,
> and not spawn any kworker on that cpu.

No, no, no! "at boot time" is an utter trainwreck. I've been trying to
get NOHZ_FULL runtime configurable. This means that your cpuset can
change at runtime and the CPU you tought you had now is a NOHZ_FULL CPU.

We must not allow pears on it.

I'm thinking that the best option might be to treat CPUs outside of your
cpuset the same as offline CPUs. That more-or-less requires that tasks
outside of your cpuset partition don't have access to your shared
memory, but that isn't an entirely insane assumption.

If you want to share memory across cpuset partitions, you get to keep
the pieces.

And the nice thing about offline, is that you don't actually need to run
anything. You only need some exclusion thing (and using a spin-loop on a
random other CPU for that is bloody insane).