Re: [PATCH 0/4] Introduce QPW for per-cpu operations

From: Marcelo Tosatti

Date: Tue Feb 24 2026 - 13:29:08 EST

On Mon, Feb 23, 2026 at 10:56:15PM +0100, Frederic Weisbecker wrote:
> Le Thu, Feb 19, 2026 at 08:30:31PM +0100, Michal Hocko a écrit :
> > On Thu 19-02-26 12:27:23, Marcelo Tosatti wrote:
> > > Michal,
> > >
> > > Again, i don't see how moving operations to happen at return to
> > > kernel would help (assuming you are talking about
> > > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> >
> > Nope, I am not talking about IPIs, although those are an example of pcp
> > state as well. I am sorry I do not have a link handy, I am pretty sure
> > Frederic will have that. Another example, though, was vmstat flushes
> > that need to be pcp. There are many other examples.
>
> Here it is:
>
> https://lore.kernel.org/all/20250410152327.24504-1-frederic@xxxxxxxxxx/
>
> Thanks.

Frederic,

I think this is a valid solution, however on systems with many CPUs, in
nohz_full, performing system calls, can't there be significant increase
of lru_lock contention ? Consider 100+ CPUs performing many system calls
which add 1 or 2 folios to per-CPU LRU lists.

Note: if you are confident about the above not being a problem,
this approach looks good to me.

commit eb709b0d062efd653a61183af8e27b2711c3cf5c
Author: Shaohua Li <shaohua.li@xxxxxxxxx>
Date: Tue May 24 17:12:55 2011 -0700

mm: batch activate_page() to reduce lock contention

The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file. Each process read access the
whole file and then exit. The process exit will do unmap_vmas() and cause
a lot of activate_page() call. In such workload, we saw about 58% total
time reduction with below patch. Other workloads with a lot of
activate_page also benefits a lot too.

...
The most significent are:
case-lru-file-readtwice -11.69%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq -69.72%

Some Gemini answers (question was "list of nohz_full usecases"):

2. Scientific Simulation & Research

Research institutions (like CERN, NASA, or national labs) use nohz_full
for "tightly coupled" parallel workloads.

Workloads: Molecular dynamics, fluid dynamics (CFD), and weather forecasting (e.g., WRF models).

The "Barrier" Problem: In massive clusters using MPI (Message Passing
Interface), all CPUs often have to reach a synchronization barrier
before the next step of a simulation. If one CPU is delayed by a few
milliseconds due to a timer tick, all other thousands of CPUs sit idle
waiting for it. nohz_full prevents this "tail latency" from stalling the
entire supercomputer.

...

4. Competitive Benchmarking & Kernel Development
Performance engineers use this mode to get "clean" numbers when testing
new hardware or compilers.

Workloads: Core-to-core latency tests, cache-bandwidth benchmarks, and
standard suites like SPEC CPU.

Goal: Eliminating the "noise" of the operating system so that the
results reflect pure hardware performance.

...

Summary Table: Who uses nohz_full?
User Group Primary Workload Why they use it
Quant Firms High-Frequency Trading To prevent micro-stutter during trade execution.
Research Labs MPI-based Simulations To avoid the "slowest node" stalling the whole cluster.
Telcos/ISP 5G/Packet Processing To ensure wire-speed processing without interrupts.
Hardware Vendors Chip Validation To benchmark CPU performance without OS interference.

Here is how scientific simulations handle system calls:

1. The "Compute-Loop" (Low Syscall)

The core of a simulation (like a GROMACS molecular dynamics step) is
just raw math: fetching data from RAM, doing floating-point arithmetic
(AVX/SSE), and writing it back.

During the loop: The CPU stays in "Userspace" for millions of cycles
without ever asking the kernel for help.

Why it works: Since there are no system calls, nohz_full can
successfully turn off the timer tick, allowing the CPU to focus 100% on
the math.

2. The "Communication-Phase" (High Syscall)

System calls usually happen only at the end of a computation block, when
the simulation needs to talk to other nodes.

The Tools: MPI (Message Passing Interface) uses system calls like write,
sendmsg, or specialized RDMA calls to move data across the network.

The Pattern: These simulations follow a "Burst" pattern—long periods
of zero system calls (computation) followed by a quick burst of system
calls (synchronization).

3. When are they "Syscall Intensive"?

There are specific parts of a simulation that are intensive, but
researchers try to minimize them:

I/O Operations: Writing "checkpoints" or large trajectory files to disk
(using write()). This is why high-end HPC systems use Asynchronous I/O
or dedicated I/O nodes—to keep the compute cores from getting bogged
down in system calls.

Memory Allocation: Constantly calling malloc/free involves the brk or
mmap system calls. Optimized simulation tools pre-allocate all the
memory they need at startup to avoid this.