user-space concurrent pipe buffer scheduler interactions
From: Michael Clark
Date: Tue Apr 02 2024 - 16:56:02 EST
Folks,
I am working on a low latency cross-platform concurrent pipe buffer
using C11 threads and atomics. It is portable code using a <stdatomic.h>
polyfill on Windows that wraps the intrinsics that Microsoft provides.
There is a detailed write up with implementation details, source code,
tests and benchmark results in the URL here:
- https://github.com/michaeljclark/cpipe/
I have been eagerly following the work of Jens on io_uring which is why
I am including him as he may be interested in these scheduler findings,
because I am currently using busy memory polling for synchronization.
The reason why I am writing here, is that I think I now have a pretty
decent test case to test the Windows and Linux schedulers side-by-side.
Let's just say it has been an eye opening process and I think folks here
might be interested in what I am seeing and what we could predict should
happen based on Amdahl's Law and low-level cache ping-pong on atomics.
Let me cut to the chase. What I am observing is a situation where when I
add threads on Windows, performance increases, but when I add threads on
Linux, performance decreases. I don't know exactly why. I am wondering
if Windows is doing some topologically affine scheduling? or if it is
using performance counters to intuit scheduling decisions? I have
checked the codegen and it is basically two LOCK CMPXCHG instructions.
I ran bare metal tests on Kaby Lake and Skylake processors on both OSes:
- `Windows 11 Version 23H2 Build 22631.3296`
- `Linux 6.5.0-25-generic #25~22.04.1-Ubuntu`
In any case, here are numbers. I will let them speak for themselves:
# Minimum Latency (nanoseconds)
| | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) | ~219ns | ~362ns | ~7692ns |
| Skylake (i9-7980XE) | ~404ns | ~425ns | ~9183ns |
# Message Rate (messages per second)
| | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) | 4.55M | 2.71M | 129.62K |
| Skylake (i9-7980XE) | 2.47M | 2.35M | 108.89K |
# Bandwidth 32KB buffer (1-thread)
| | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) | 2.91GB/sec | 1.36GB/sec | 1.72GB/sec |
| Skylake (i9-7980XE) | 2.98GB/sec | 1.44GB/sec | 1.67GB/sec |
# Bandwidth 32KB buffer (4-threads)
| | cpipe win11 | cpipe linux |
|:---------------------|------------:|------------:|
| Kaby Lake (i7-8550U) | 5.56GB/sec | 0.79GB/sec |
| Skylake (i9-7980XE) | 7.11GB/sec | 0.89GB/sec |
I think we have a very useful test case here for the Linux scheduler. I
have been working on a generalization of memory polled user-space queue
and this is about the 5th iteration where I have been very careful about
modulo arithmetic and overflow as the normal case.
I know it is a little unfair to compare latency with Linux pipes and
also we waste a lot of time spinning on queue full. This is where we
would really like to use something like SENDUIPI, UMONITOR and UMWAIT
but I don't have access to silicon that supports those yet.
Regards,
Michael Clark