Re: user-space concurrent pipe buffer scheduler interactions

From: Michael Clark
Date: Wed Apr 03 2024 - 16:52:45 EST


On 4/4/24 05:56, Linus Torvalds wrote:
On Tue, 2 Apr 2024 at 13:54, Michael Clark <michael@xxxxxxxxxxxxxxxx> wrote:

I am working on a low latency cross-platform concurrent pipe buffer
using C11 threads and atomics.

You will never get good performance doing spinlocks in user space
unless you actually tell the scheduler about the spinlocks, and have
some way to actually sleep on contention.

Which I don't see you as having.

We can work on this.

So maybe it is possible to look at how many LOCK instructions were retired in the last scheduler quantum ideally with retired-success, retired-failed for interlocked-compare-and-swap. Maybe it is just a performance counter and doesn't require perf tracing switched on?

Then you can probably make a queue of processes in lock contention but the hard part is deducing who had contention with who. I will need to think about this for a while. We know the latency when things are not contended because these critical sections are usually small. It's about ~200-400ns and you can get these numbers in a loop at boot up.

But I don't know how we can see spinning on acquires. It makes me think that the early relaxed/acquire comparison before the LOCK op is bad. I got a very minor performance boost but it would break the strategy I just mentioned because we wouldn't have a LOCK CMPXCHG in our spin loop. We would know for certain "that" process had a failed LOCK CMPXCHG.

So I would need to delete this line and other lines like this:

https://github.com/michaeljclark/cpipe/blob/13c0ad1a865b9cc0174fc8f61d76f37bdbf11d4d/include/buffer.h#L317

I also want a user-space wrapper for futexes for a waitlist_t that rechecks conditions and uses cond_timeout on old broken POSIX systems so that we won't deadlock due to a missed wake-up. FreeBSD, macOS and Windows are starting to look like they might have something we can use.

WaitOnAddress in Windows has a compare not-equals, and supports 8, 16, 32, and 64 bit words, but when used to construct equals, which is what I need, or less-than or greater-than it could suffer from thundering herd if used in a decentralized way in user-space. Maybe we would need an address waiter list for an address stashed in a CPU struct for the lead waiter, then centrally recheck the condition and when appropriate reschedule those sleeping in the queue for events on that address?

Sorry I don't know how the Linux scheduler and futexes work internally.
I just want to use this stuff in user-space. I want a POSIX waitlist_t.

I am working on a tiny emulator for the Windows Hypervisor which is how I justify the present "embedded" version which spins. This pipe buffer is for a tiny test kernel to get rid of a janky lock around printf.

- https://github.com/michaeljclark/emu

Thanks,
Michael.