Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq

From: Dmitry Vyukov
Date: Tue Sep 26 2023 - 20:40:10 EST


On Tue, 26 Sept 2023 at 13:52, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
> >> I don't see why we can't stick this directly into struct rseq because
> >> it's all public anyway.
> >
> > The motivation for moving this to a different cache line is to handle
> > the prior comment from Boqun, who is concerned that busy-waiting
> > repeatedly loading a field from struct rseq will cause false-sharing and
> > make other stores to that cache line slower, especially stores to
> > rseq_cs to begin rseq critical sections, thus slightly increasing the
> > overhead of rseq critical sections taken while mutexes are held.
> >
> > If we want to embed this field into struct rseq with its own cache line,
> > then we need to add a lot of padding, which is inconvenient.
> >
> > That being said, perhaps this is premature optimization, what do you think ?
>
> Hi Mathieu, Florian,
>
> This is exciting!
>
> I thought the motivation for moving rseq_sched_state out of struct rseq
> is lifetime management problem. I assume when a thread locks a mutex,
> it stores pointer to rseq_sched_state in the mutex state for other
> threads to poll. So the waiting thread would do something along the following
> lines:
>
> rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED);
> if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU))
> futex_wait();
>
> Now if the state is struct rseq, which is stored in TLS,
> then the owning thread can unlock the mutex, exit and unmap TLS in between.
> Consequently, load of state->state will cause a paging fault.
>
> And we do want rseq in TLS to save 1 indirection.
>
> If rseq_sched_state is separated from struct rseq, then it can be allocated
> in type stable memory that is never unmapped.
>
> What am I missing here?
>
> However, if we can store this state in struct rseq, then an alternative
> interface would for the kernel to do:
>
> rseq->cpu_id = -1;
>
> to denote that the thread is not running on any CPU.
> I think it kinda makes sense, rseq->cpu_id is the thread's current CPU,
> and -1 naturally means "not running at all". And we already store -1
> right after init, so it shouldn't be a surprising value.

As you may know we experimented with "virtual CPUs" in tcmalloc. The
extension allows kernel to assign dense virtual CPU numbers to running
threads instead of real sparse CPU numbers:

https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/linux_syscall_support.h#L30-L41

Recently I added another change that [ab]uses rseq in an interesting
way. We want to get notifications about thread re-scheduling. A bit
simplified version of this is as follows:
we don't use rseq.cpu_id_start for its original purpose, so instead we
store something else there with a high bit set. Real CPU numbers don't
have a high bit set (at least while you have less than 2B CPUs :)).
This allows us to distinguish the value we stored in rseq.cpu_id_start
from real CPU id stored by the kernel.
Inside of rseq critical section we check if rseq.cpu_id_start has high
bit set, and if not, then we know that we were just rescheduled, so we
can do some additional work and update rseq.cpu_id_start to have high
bit set.

In reality it's a bit more involved since the field is actually 8
bytes and only partially overlaps with rseq.cpu_id_start (it's an
8-byte pointer with high 4 bytes overlap rseq.cpu_id_start):

https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165

I am thinking if we could extend the current proposed interface in a
way that would be more flexible and would satisfy all of these use
cases (spinlocks, and possibility of using virtual CPUs and
rescheduling notifications). In the end they all need a very similar
thing: kernel writing some value at some user address when a thread is
de-scheduled.

The minimal support we need for tcmalloc is an 8-byte user address +
kernel writing 0 at that address when a thread is descheduled.

The most flexible option to support multiple users
(malloc/spinlocks/something else) would be as follows:

User-space passes an array of structs with address + size (1/2/4/8
bytes) + value.
Kernel intereates over the array when the thread is de-scheduled and
writes the specified value at the provided address/size.
Something along the following lines (pseudo-code):

struct rseq {
...
struct rseq_desched_notif_t* desched_notifs;
int desched_notif_count;
};

struct rseq_desched_notif_t {
void* addr;
uint64_t value;
int size;
};

static inline void rseq_preempt(struct task_struct *t)
{
...
for (int i = 0; i < t->rseq->desched_notif_count; i++) {
switch (t->rseq->desched_notifs[i].size) {
case 1: put_user1(t->rseq->desched_notifs[i].addr,
t->rseq->desched_notifs[i].value);
case 2: put_user2(t->rseq->desched_notifs[i].addr,
t->rseq->desched_notifs[i].value);
case 4: put_user4(t->rseq->desched_notifs[i].addr,
t->rseq->desched_notifs[i].value);
case 8: put_user8(t->rseq->desched_notifs[i].addr,
t->rseq->desched_notifs[i].value);
}
}
}