user-space concurrent pipe buffer scheduler interactions

From: Michael Clark
Date: Tue Apr 02 2024 - 16:56:02 EST


Folks,

I am working on a low latency cross-platform concurrent pipe buffer using C11 threads and atomics. It is portable code using a <stdatomic.h> polyfill on Windows that wraps the intrinsics that Microsoft provides. There is a detailed write up with implementation details, source code, tests and benchmark results in the URL here:

- https://github.com/michaeljclark/cpipe/

I have been eagerly following the work of Jens on io_uring which is why I am including him as he may be interested in these scheduler findings, because I am currently using busy memory polling for synchronization.

The reason why I am writing here, is that I think I now have a pretty decent test case to test the Windows and Linux schedulers side-by-side. Let's just say it has been an eye opening process and I think folks here might be interested in what I am seeing and what we could predict should happen based on Amdahl's Law and low-level cache ping-pong on atomics.

Let me cut to the chase. What I am observing is a situation where when I add threads on Windows, performance increases, but when I add threads on Linux, performance decreases. I don't know exactly why. I am wondering if Windows is doing some topologically affine scheduling? or if it is using performance counters to intuit scheduling decisions? I have checked the codegen and it is basically two LOCK CMPXCHG instructions.

I ran bare metal tests on Kaby Lake and Skylake processors on both OSes:

- `Windows 11 Version 23H2 Build 22631.3296`
- `Linux 6.5.0-25-generic #25~22.04.1-Ubuntu`

In any case, here are numbers. I will let them speak for themselves:

# Minimum Latency (nanoseconds)

| | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) | ~219ns | ~362ns | ~7692ns |
| Skylake (i9-7980XE) | ~404ns | ~425ns | ~9183ns |

# Message Rate (messages per second)

| | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) | 4.55M | 2.71M | 129.62K |
| Skylake (i9-7980XE) | 2.47M | 2.35M | 108.89K |

# Bandwidth 32KB buffer (1-thread)

| | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) | 2.91GB/sec | 1.36GB/sec | 1.72GB/sec |
| Skylake (i9-7980XE) | 2.98GB/sec | 1.44GB/sec | 1.67GB/sec |

# Bandwidth 32KB buffer (4-threads)

| | cpipe win11 | cpipe linux |
|:---------------------|------------:|------------:|
| Kaby Lake (i7-8550U) | 5.56GB/sec | 0.79GB/sec |
| Skylake (i9-7980XE) | 7.11GB/sec | 0.89GB/sec |

I think we have a very useful test case here for the Linux scheduler. I have been working on a generalization of memory polled user-space queue and this is about the 5th iteration where I have been very careful about modulo arithmetic and overflow as the normal case.

I know it is a little unfair to compare latency with Linux pipes and also we waste a lot of time spinning on queue full. This is where we would really like to use something like SENDUIPI, UMONITOR and UMWAIT but I don't have access to silicon that supports those yet.

Regards,
Michael Clark