Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation

From: Aaron Tomlin

Date: Wed Jan 07 2026 - 11:26:52 EST

On Wed, Jan 07, 2026 at 11:26:59AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 07, 2026 at 10:48:12AM +0100, Juri Lelli wrote:
> > Hello!
> >
> > On 06/01/26 09:49, Aaron Tomlin wrote:
> > > On Tue, Jan 06, 2026 at 02:37:49PM +0530, Shrikanth Hegde wrote:
> > > > If all your SCHED_FIFO is pinned and their scheduling decisions
> > > > are managed in userspace, using isolcpus would offer you better
> > > > isolations compared to nohz_full.
> > >
> > > Hi Shrikanth,
> > >
> > > You are entirely correct; isolcpus=domain (or isolcpus= without flags as
> > > per housekeeping_isolcpus_setup()) indeed offers superior isolation by
> > > removing the CPU from the scheduler load-balancing domains.
> > >
> > > I must apologise for the omission in my previous correspondence. I
> > > neglected to mention that our specific configuration utilises isolcpus= in
> > > conjunction with nohz_full=.
> > >
> > > > > However, the extant "Fair Server" (Deadline Server) architecture
> > > > > compromises this isolation guarantee. At present, should a background
> > > > > SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server
> > > > > (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity,
> > > > > its activation increments rq->dl.dl_nr_running.
> > > > >
> > > >
> > > > There is runtime allocated to fair server. If you make them 0 on CPUs of
> > > > interest, wouldn't that work?
> > > >
> > > > /sys/kernel/debug/sched/fair_server/<cpu>/runtime
> > >
> > > Yes, you are quite right; setting the fair server runtime to 0 (via
> > > /sys/kernel/debug/sched/fair_server/[cpu]/runtime) does indeed achieve the
> > > desired effect. In my testing, the SCHED_FIFO task on the fully
> > > adaptive-tick CPU remains uninterrupted by the restored clock-tick when
> > > this configuration is applied. Thank you.
> > >
> > > However, I believe it would be beneficial if this scheduling feature were
> > > available as an automatic kernel detection mechanism. While the manual
> > > runtime adjustment works, having the kernel automatically detect the
> > > condition - where an RT task is running and bandwidth enforcement is
> > > disabled - would provide a more seamless and robust solution for
> > > partitioned systems without requiring external intervention.
> > > I may consider an improved version of the patch that includes a "Fair
> > > server disabled" warning much like in sched_fair_server_write().
> >
> > I am not sure either we need/want the automatic mechanism, as we already
> > have the fair_server interface. I kind of think that if any (kthread
> > included) CFS task is enqueued on an "isolated" CPU the problem might
> > reside in sub-optimal isolation (usually a config issue or a kernel
> > issue that might need solving - e.g. a for_each_cpu loop that needs
> > changing). Starving such tasks might anyway end in a system crash of
> > sort.
>
> We must not starve fair tasks -- this can severely affect the system
> health.
>
> Specifically per-cpu kthreads getting starved can cause complete system
> lockup when other CPUs go wait for completion and such.
>
> We must not disable the fair server, ever. Doing do means you get to
> keep the pieces.
>
> The only sane way is to ensure these tasks do not get queued in the
> first place.

Hi Shrikanth, Valentin, Juri, Daniel, Peter,

I fully appreciate your concerns regarding system health and the critical
nature of per-CPU kthreads. I agree that under standard operation,
disabling the Fair Server presents a significant risk of system lockup.
Your suggestion to ensure such tasks are prevented from being queued in the
first instance is an interesting proposition and certainly merits further
consideration - I will look into it.

However, I would respectfully submit that the kernel currently affords
users the capability to manually disable runtime for each CFS task via
/sys/kernel/debug/sched/fair_server/. This establishes a precedent wherein
the user is permitted to assume full responsibility for the scheduler's
behaviour on specific cores.

If my understanding is correct, in the current scenario, should a user
manually set the CFS runtime to zero, CPU-specific kthreads operating as
SCHED_NORMAL are already precluded from circumvention by the Fair Server
mechanism when a real-Time task is executing. The risk you describe is,
therefore, already present for those who utilise the debug interface, I
think.

The rationale behind introducing RT_SUPPRESS_FAIR_SERVER is to formalise
this behaviour for a specific, highly educated class of user (e.g., HFT or
HPC operators) who explicitly prioritise absolute determinism over general
system stability, for a period of time - we still maintain the ability to
terminate/interrupt a real-time task via a signal (e.g., SIGINT). As this
scheduling feature is disabled by default, the user must actively opt-in,
thereby signalling their willingness to "sacrifice" safety guarantees and
accept the potential consequences - or "keep the pieces," as it were.

I believe this approach provides a necessary tool for extreme
latency-sensitive partitions without compromising the safety of the
general-purpose kernel.

Kind regards,
--
Aaron Tomlin

Attachment: signature.asc
Description: PGP signature