Re: Requirements to control kernel isolation/nohz_full at runtime

From: Frederic Weisbecker
Date: Wed Sep 09 2020 - 18:34:09 EST

On Thu, Sep 03, 2020 at 03:23:59PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > == Unbound affinity ==
> >
> > Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
> > interfaces: sysfs, procfs, etc...
> We were looking at a userspace interface: what would be a proper
> (unified, similar to isolcpus= interface) and its implementation:
> The simplest idea for interface seemed to be exposing the integer list of
> CPUs and isolation flags to userspace (probably via sysfs).
> The scheme would allow flags to be separately enabled/disabled,
> with not all flags being necessary toggable (could for example
> disallow nohz_full= toggling until it is implemented, but allow for
> other isolation features to be toggable).
> This would require per flag housekeeping_masks (instead of a single).

Right, I think cpusets provide exactly.

> Back to the userspace interface, you mentioned earlier that cpusets
> was a possibility for it. However:
> "Cpusets provide a Linux kernel mechanism to constrain which CPUs and
> Memory Nodes are used by a process or set of processes.
> The Linux kernel already has a pair of mechanisms to specify on which
> CPUs a task may be scheduled (sched_setaffinity) and on which Memory
> Nodes it may obtain memory (mbind, set_mempolicy).
> Cpusets extends these two mechanisms as follows:"
> The isolation flags do not necessarily have anything to do with
> tasks, but with CPUs: a given feature is disabled or enabled on a
> given CPU.
> No?

When cpusets are set as exclusive, they become strict CPU properties.
I think we'll need to enforce the exclusive property to set the isolated

Then you're free to move the tasks you like into any isolated cpusets.

> Regarding locking of the masks, since housekeeping_masks can be called
> from hot paths (eg: get_nohz_timer_target) it seems RCU is a natural
> fit, so userspace would:
> 1) use interface to change cpumask for a given feature:
> -> set_rcu_pointer
> -> wait for grace period

Yep, could be a solution.

> 2) proceed to trigger actions that rely on housekeeping_cpumask,
> to validate the cpumask at 1) is being used.

Exactly. I guess we can simply call directly to subsystems (timers,
workqueue, kthreads, ...) from the isolation code upon cpumask update.
This way we avoid ordering surprises that would come with a notifier.

> Regarding nohz_full=, a way to get an immediate implementation
> (without handling the issues you mention above) would be to boot
> with a set of CPUs as "nohz_full toggable" and others not. For
> the nohz_full toggable ones, you'd introduce a per-CPU tick
> dependency that is enabled/disabled on runtime. Probably better
> to avoid this one if possible...

Right but you would still have all the overhead that comes with nohz full
(kernel entry/exit tracking, RCU userspace extended grace period, RCU callbacks
offloaded, vtime accounting, ...). It will become really interesting once we
can switch all that overhead off.