Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq

From: Mathieu Desnoyers
Date: Tue May 30 2023 - 10:27:20 EST


On 5/30/23 04:20, Florian Weimer wrote:
* Mathieu Desnoyers:

I don't see why we can't stick this directly into struct rseq because
it's all public anyway.

The motivation for moving this to a different cache line is to handle
the prior comment from Boqun, who is concerned that busy-waiting
repeatedly loading a field from struct rseq will cause false-sharing
and make other stores to that cache line slower, especially stores to
rseq_cs to begin rseq critical sections, thus slightly increasing the
overhead of rseq critical sections taken while mutexes are held.

Hmm. For context, in glibc, we have to place struct rseq on a fixed
offset from the start of a page (or even some larger alignment) for all
threads. In the future (once we move the thread control block off the
top of the userspace stack, where it resides since the LinuxThreads
days), it is likely that the pointer difference between different
threads will also be a multiple of a fairly large power of two
(something like 2**20 might be common). Maybe this will make caching
even more difficult?

If we want to embed this field into struct rseq with its own cache
line, then we need to add a lot of padding, which is inconvenient.

That being said, perhaps this is premature optimization, what do you
think ?

Maybe? I don't know how the access patterns will look like. But I
suspect that once we hit this case, performance will not be great
anyway, so the optimization is perhaps unnecessary?

What I dislike though is that contention for any lock which busy-waits on the rseq sched_state would slow down all rseq critical sections of that thread, which is a side-effect we want to avoid.

I've done some more additional benchmarks on my 8-core AMD laptop, and I notice that things get especially bad whenever the store to rseq_abi->rseq_cs is surrounded by other instructions that need to be ordered with that store, e.g. a for loop doing 10 stores to a local variables. If it's surrounded by instructions that don't need to be ordered wrt that store (e.g. a for loop of 10 iterations issuing barrier() "memory" asm clobbers), then the overhead cannot be noticed anymore.


The challenge is that once we put stuff at fixed offsets, we can't
transparently fix it later. It would need more auxv entries with
further offsets, or accessing this data through some indirection,
perhaps via vDSO helpers.

Perhaps this is more flexibility/complexity than we really need. One possible approach would be to split struct rseq into sub-structures, e.g.:

rseq_len = overall size of all sub-structures.
auxv AT_RSEQ_ALIGN = 256

auxv AT_RSEQ_FEATURE_SIZE = size of first portion of struct rseq,
at most 256 bytes, meant to contain fields
stored/loaded from the thread doing the
registration.
auxv AT_RSEQ_SHARED_FEATURE_SIZE =
size of 2nd portion of struct rseq,
starts at offset 256, at most 256 bytes,
meant to contain fields stored/loaded by
any thread.

Then we have this layout:

struct rseq {
struct rseq_local {
/* Fields accessed from local thread. */

} __attribute__((aligned((256));
struct rseq_shared {
/* Shared fields. */

} __attribute__((aligned(256));
} __attribute__((aligned(256));

And if someday AT_RSEQ_FEATURE_SIZE needs to grow over 256 bytes
(32 * u64), we can still extend with a new auxv entry after the "shared"
features.



The TID field would be useful in its own right.

Indeed, good point.

While we are there, I wonder if we should use the thread_pointer() as
lock identifier, or if the address of struct rseq is fine ?

Hard to tell until we'll see what the futex integration looks like, I
think.

Good point. I can choose one way or another for the prototype, and then we'll see how things go with futex integration.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com