Re: [PATCH 0/4] Introduce QPW for per-cpu operations

From: Frederic Weisbecker

Date: Tue Mar 03 2026 - 06:09:42 EST


Le Thu, Feb 26, 2026 at 08:41:09AM -0300, Marcelo Tosatti a écrit :
> On Wed, Feb 25, 2026 at 10:49:54PM +0100, Frederic Weisbecker wrote:
>
> <snip>
>
> > > There are specific parts of a simulation that are intensive, but
> > > researchers try to minimize them:
> > >
> > > I/O Operations: Writing "checkpoints" or large trajectory files to disk
> > > (using write()). This is why high-end HPC systems use Asynchronous I/O
> > > or dedicated I/O nodes—to keep the compute cores from getting bogged
> > > down in system calls.
> > >
> > > Memory Allocation: Constantly calling malloc/free involves the brk or
> > > mmap system calls. Optimized simulation tools pre-allocate all the
> > > memory they need at startup to avoid this.
> >
> > Ok. I asked a similar question and got this (you made me use an LLM for the
> > first time btw, I held out for 4 years... I'm sure I can wait 4 more years until
> > the next usage :o)
>
> You should use it more often, it can save a significant amount of time
> :-)

I fear the earth doesn't have the resources to serve daily use of LLM to us
all. Meanwhile it was a pleasant surprise to see it in action and answer questions
I had to myself for a long while. And I might use it again on the rare occasions
where a simple search engine request doesn't do the job.

> > ### 2. The "Slow Path" (System Calls / Syscalls)
> >
> > Passing through the kernel (a syscall) is necessary in certain situations, but it is "expensive" because it forces a **context switch**, which flushes CPU caches.
> >
> > * **Initialization:** During startup (`MPI_Init`), many syscalls are used to create sockets, map shared memory (`mmap`), and configure network interfaces.
> > * **Standard TCP/IP:** If you are not using a high-performance network (RDMA) but simple Ethernet instead, MPI must call `send()` and `recv()`, which are syscalls. The Linux kernel then takes over to manage the TCP/IP stack.
> > * **Sleep Mode (Blocking):** If an MPI process waits for a message for too long, it may decide to "go to sleep" to yield the CPU to another task via syscalls like `futex()` or `poll()`.
> >
> > **In summary:** MPI synchronization aims to be **100% User-Space** (via memory polling) to avoid syscall latency. It is precisely because MPI tries to bypass the kernel that we use `nohz_full`: we are asking the kernel not to even "knock on the CPU's door" with its clock interruptions.
>
> Of course, there is a cost to system calls. However, considering
> "low latency applications must necessarily remain in userspace,
> therefore lets optimize only for that case" is limiting IMHO.
>
> Should avoid interruptions whenever possible, for isolated CPUs
> (in userspace _and_ kernelspace).

Very low latency requirements really should bend toward full userspace.
But you're right that isolation (even full with nohz_full) should probably
not be limited to that. HPC shows such a usecase where the workload is not
perfectly isolated and yet nohz_full brings improvements.

Thanks.

--
Frederic Weisbecker
SUSE Labs