Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

From: Julien Desfossez
Date: Fri Mar 29 2019 - 09:36:02 EST


On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar <subhra.mazumdar@xxxxxxxxxx>
wrote:
> Is the core wide lock primarily responsible for the regression? I ran
> upto patch
> 12 which also has the core wide lock for tagged cgroups and also calls
> newidle_balance() from pick_next_task(). I don't see any regression. Of
> course
> the core sched version of pick_next_task() may be doing more but
> comparing with
> the __pick_next_task() it doesn't look too horrible.

On further testing and investigation, we also agree that spinlock contention
is not the major cause for the regression, but we feel that it should be one
of the major contributing factors to this performance loss.

To reduce the scope of the investigation of the performance regression, we
designed a couple of smaller test cases (compared to big VMs running complex
benchmarks) and it turns out the test case that is most impacted is a simple
disk write-intensive case (up to 99% performance drop). CPU-intensive and
scheduler-intensive tests (perf bench sched) behave pretty well.

On the same server we used before (2x18 cores, 72 hardware threads), with
all the non-essential services disabled, we setup a cpuset of 4 cores (8
hardware threads) and ran sysbench fileio on a dedicated drive (no RAID).
With sysbench running with 8 threads in this cpuset without core scheduling,
we get about 155.23 MiB/s in sequential write. If we enable the tag, we drop
to 0.25 MiB/s. Interestingly, even with 4 threads, we see the same kind of
performance drop.

Command used:

sysbench --test=fileio prepare
cgexec -g cpu,cpuset:test sysbench --threads=4 --test=fileio \
--file-test-mode=seqwr run

If we run this with the data in a ramdisk instead of a real drive, we donât
notice any drop. The amount of performance drops depends a bit depending on
the machine, but itâs always significant.

We spent a lot of time in the trace and noticed that a couple times during
every run, the sysbench worker threads are waiting for IO sometimes up to 4
seconds, all the threads wait for the same duration, and during that time we
donât see any block-related softirq coming in. As soon as the interrupt is
processed, sysbench gets woken up immediately. This long wait never happens
without the core scheduling. So we are trying to see if there is a place
where the interrupts are disabled for an extended period of time. The
irqsoff tracer doesnât seem to pick it up.

Any thoughts about that ?