What do you mean by "unavailable as idle"? An idle CPU is different from an unvailable CPU, I think.
On Tue, 2025-04-15 at 11:30 -0400, Waiman Long wrote:
On 4/15/25 6:25 AM, Gabriele Monaco wrote:Yes, of course.. My bad. Thanks for spotting.
The timer migration mechanism allows active CPUs to pull timersI think you mean "A core is available if NOT isolated and NOT
from
idle ones to improve the overall idle time. This is however
undesired
when CPU intensive workloads run on isolated cores, as the
algorithm
would move the timers from housekeeping to isolated cores,
negatively
affecting the isolation.
This effect was noticed on a 128 cores machine running oslat on the
isolated cores (1-31,33-63,65-95,97-127). The tool monopolises
CPUs,
and the CPU with lowest count in a timer migration hierarchy (here
1
and 65) appears as always active and continuously pulls global
timers,
from the housekeeping CPUs. This ends up moving driver work (e.g.
delayed work) to isolated CPUs and causes latency spikes:
before the change:
# oslat -c 1-31,33-63,65-95,97-127 -D 62s
...
Maximum: 1203 10 3 4 ... 5 (us)
after the change:
# oslat -c 1-31,33-63,65-95,97-127 -D 62s
...
Maximum: 10 4 3 4 3 ... 5 (us)
Exclude isolated cores from the timer migration algorithm, extend
the
concept of unavailable cores, currently used for offline ones, to
isolated ones:
* A core is unavailable if isolated or offline;
* A core is available if isolated and offline;
offline".
Right?
A core is considered unavailable as idle if:
* is in the isolcpus list
* is in the nohz_full list
* is in an isolated cpuset
Due to how the timer migration algorithm works, any CPU part of the
hierarchy can have their global timers pulled by remote CPUs and
have to
pull remote timers, only skipping pulling remote timers would break
the
logic.
For this reason, we prevents isolated CPUs from pulling remote
global
timers, but also the other way around: any global timer started on
an
isolated CPU will run there. This does not break the concept of
isolation (global timers don't come from outside the CPU) and, if
considered inappropriate, can usually be mitigated with other
isolation
techniques (e.g. IRQ pinning).
Good point, so the check is indeed needed.Since tmigr_cpu_available is shared between isolated and offline CPUs,diff --git a/kernel/time/timer_migration.cThere are two main sets of isolated CPUs used by cpu_is_isolated() -
b/kernel/time/timer_migration.c
index 1fae38fbac8c2..6fe6ca798e98d 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -10,6 +10,7 @@
#include <linux/spinlock.h>
#include <linux/timerqueue.h>
#include <trace/events/ipi.h>
+#include <linux/sched/isolation.h>
#include "timer_migration.h"
#include "tick-internal.h"
@@ -1445,7 +1446,7 @@ static long tmigr_trigger_active(void
*unused)
static int tmigr_cpu_unavailable(unsigned int cpu)
{
- struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu);
int migrator;
u64 firstexp;
@@ -1472,15 +1473,18 @@ static int tmigr_cpu_unavailable(unsigned
int cpu)
static int tmigr_cpu_available(unsigned int cpu)
{
- struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu);
/* Check whether CPU data was successfully initialized */
if (WARN_ON_ONCE(!tmc->tmgroup))
return -EINVAL;
+ /* Isolated CPUs don't participate in timer migration */
+ if (cpu_is_isolated(cpu))
+ return 0;
boot-time isolated CPUs via "isolcpus" and "nohz_full" boot command
time
options and runtime isolated CPUs via cpuset isolated partitions. The
check for runtime isolated CPUs is redundant here as those CPUs won't
be
passed to tmigr_cpu_available().
I added this check also to make sure bringing an isolated CPU back
online won't make it available for tmigr.
So this call is effectively removingDo you mean I should make clear that the check in tmigr_cpu_available
the boot time isolated CPUs away from the available cpumask
especially
during the boot up process. Maybe you can add some comment about this
behavioral change.
is especially meaningful at boot time (i.e. when CPUs are first brought
online)?
Thanks,
Yeah, I probably should, good point. I had that kind of comment in v1
while allocating the mask and removed it while changing a few things.
I'm going to make that comment more verbose to clarify when exactly
it's needed.