Re: [PATCH 0/4] Introduce QPW for per-cpu operations

From: Marcelo Tosatti

Date: Mon Mar 02 2026 - 10:59:41 EST

On Wed, Feb 25, 2026 at 10:49:54PM +0100, Frederic Weisbecker wrote:

<snip>

> > There are specific parts of a simulation that are intensive, but
> > researchers try to minimize them:
> >
> > I/O Operations: Writing "checkpoints" or large trajectory files to disk
> > (using write()). This is why high-end HPC systems use Asynchronous I/O
> > or dedicated I/O nodes—to keep the compute cores from getting bogged
> > down in system calls.
> >
> > Memory Allocation: Constantly calling malloc/free involves the brk or
> > mmap system calls. Optimized simulation tools pre-allocate all the
> > memory they need at startup to avoid this.
>
> Ok. I asked a similar question and got this (you made me use an LLM for the
> first time btw, I held out for 4 years... I'm sure I can wait 4 more years until
> the next usage :o)

You should use it more often, it can save a significant amount of time
:-)

> ### 2. The "Slow Path" (System Calls / Syscalls)
>
> Passing through the kernel (a syscall) is necessary in certain situations, but it is "expensive" because it forces a **context switch**, which flushes CPU caches.
>
> * **Initialization:** During startup (`MPI_Init`), many syscalls are used to create sockets, map shared memory (`mmap`), and configure network interfaces.
> * **Standard TCP/IP:** If you are not using a high-performance network (RDMA) but simple Ethernet instead, MPI must call `send()` and `recv()`, which are syscalls. The Linux kernel then takes over to manage the TCP/IP stack.
> * **Sleep Mode (Blocking):** If an MPI process waits for a message for too long, it may decide to "go to sleep" to yield the CPU to another task via syscalls like `futex()` or `poll()`.
>
> **In summary:** MPI synchronization aims to be **100% User-Space** (via memory polling) to avoid syscall latency. It is precisely because MPI tries to bypass the kernel that we use `nohz_full`: we are asking the kernel not to even "knock on the CPU's door" with its clock interruptions.

Of course, there is a cost to system calls. However, considering
"low latency applications must necessarily remain in userspace,
therefore lets optimize only for that case" is limiting IMHO.

Should avoid interruptions whenever possible, for isolated CPUs
(in userspace _and_ kernelspace).