Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

From: Steven Rostedt
Date: Thu Oct 26 2023 - 15:20:33 EST


On Thu, 26 Oct 2023 14:36:36 -0400
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

> > static inline unsigned long
> > xchg(volatile unsigned *ptr, unsigned new)
> > {
> > unsigned ret = new;
> >
> > asm volatile("xchg %b0,%1"
>
> which has an implicit lock prefix (xchg with a memory operand is a
> special-case):
>
> Quoting Intel manual:
>
> "If a memory operand is referenced, the processor’s locking protocol is
> automatically implemented for the duration of the exchange operation,
> regardless of the presence or absence of the LOCK prefix or of the value
> of the IOPL. (See the LOCK prefix description in this chapter for more
> information on the locking protocol.)"
>

Bah. I learn something new every day :-p

I thought xchg required specifying lock too. This means that cmpxchg is
actually faster than xchg! It's been a long time since I had written
assembly.

What makes this interesting though, is that when I ran my tests originally,
when my change was disabled, the xchg() code never executed, and it still
showed a significant improvement. That is, even with adding xchg(), it
still improved much more. But that's probably because the xchg() happens
after releasing the lock and after that it goes into a loop waiting for
another thread to make the update, which requires a lock cmpxchg(). Hence,
the xchg() slowdown wasn't in a critical path of the test.

Anyway, I changed the code to use:

static inline unsigned clrbit(volatile unsigned *ptr)
{
unsigned ret;

asm volatile("andb %b1,%0"
: "+m" (*(volatile char *)ptr)
: "iq" (0x2)
: "memory");

ret = *ptr;
*ptr = 0;

return ret;
}

I just used andb as that's a locally atomic operation. I could also use bts
(as Mathieu suggested to me on IRC). It doesn't matter as this is just
example code and is not in production. How this is implemented is not an
important part of the algorithm. Just that it can atomically clear the bit
it sets without a race with the kernel setting the bit it sets. I could
modify the code to put these bits in separate bytes as well. That's just an
implementation detail we can work on later.

I updated my test to have:

static bool no_rseq;

static void init_extend_map(void)
{
int ret;

if (no_rseq)
return;

ret = rseq(&rseq_map, sizeof(rseq_map), 0, 0);
perror("rseq");
printf("ret = %d (%zd) %p\n", ret, sizeof(rseq_map), &rseq_map);
}

static inline unsigned clrbit(volatile unsigned *ptr)
{
unsigned ret;

asm volatile("andb %b1,%0"
: "+m" (*(volatile char *)ptr)
: "iq" (0x2)
: "memory");

ret = *ptr;
*ptr = 0;

return ret;
}

static void extend(void)
{
if (no_rseq)
return;

rseq_map.cr_flags = 1;
}

static void unextend(void)
{
unsigned prev;

if (no_rseq)
return;

prev = clrbit(&rseq_map.cr_flags);
if (prev & 2) {
tracefs_printf(NULL, "Yield!\n");
sched_yield();
}
}

where the tests with it disabled do not run the extend or unextend() code
(as it makes sense to only perform that code with this feature, as that
would be part of the overhead to implement it).

Here's the results:

With no_rseq = true, with 5 runs, which were basically all the same, but
the best run was:

Ran for 4260797 times
Total wait time: 33.905306

And with no_rseq = false;

Most runs were like:

Ran for 5378845 times
Total wait time: 32.253018

But the worse run was:

Ran for 4991411 times
Total wait time: 31.745662

But with that lower "wait time" it probably was preempted by a something
else during that run, as the lower the "wait time" the better the result.

-- Steve