Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

From: Julien Desfossez
Date: Thu Mar 21 2019 - 17:20:59 EST


On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar <subhra.mazumdar@xxxxxxxxxx>
wrote:
> On 3/18/19 8:41 AM, Julien Desfossez wrote:
> > The case where we try to acquire the lock on 2 runqueues belonging to 2
> > different cores requires the rq_lockp wrapper as well otherwise we
> > frequently deadlock in there.
> >
> > This fixes the crash reported in
> > 1552577311-8218-1-git-send-email-jdesfossez@xxxxxxxxxxxxxxxx
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 76fee56..71bb71f 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1,
> struct rq *rq2)
> > raw_spin_lock(rq_lockp(rq1));
> > __acquire(rq2->lock); /* Fake it out ;) */
> > } else {
> > - if (rq1 < rq2) {
> > + if (rq_lockp(rq1) < rq_lockp(rq2)) {
> > raw_spin_lock(rq_lockp(rq1));
> > raw_spin_lock_nested(rq_lockp(rq2),
> SINGLE_DEPTH_NESTING);
> > } else {
> With this fix and my previous NULL pointer fix my stress tests are
> surviving. I
> re-ran my 2 DB instance setup on 44 core 2 socket system by putting each DB
> instance in separate core scheduling group. The numbers look much worse
> now.
>
> users baseline %stdev %idle core_sched %stdev %idle
> 16 1 0.3 66 -73.4% 136.8 82
> 24 1 1.6 54 -95.8% 133.2 81
> 32 1 1.5 42 -97.5% 124.3 89

We are also seeing a performance degradation of about 83% on the throughput
of 2 MySQL VMs under a stress test (12 vcpus, 32GB of RAM). The server has 2
NUMA nodes, each with 18 cores (so a total of 72 hardware threads). Each
MySQL VM is pinned to a different NUMA node. The clients for the stress
tests are running on a separate physical machine, each client runs 48 query
threads. Only the MySQL VMs use core scheduling (all vcpus and emulator
threads). Overall the server is 90% idle when the 2 VMs use core scheduling,
and 75% when they donât.

The rate of preemption vs normal âswitch outâ is about 1% with and without
core scheduling enabled, but the overall rate of sched_switch is 5 times
higher without core scheduling which suggests some heavy contention in the
scheduling path.

On further investigation, we could see that the contention is mostly in the
way rq locks are taken. With this patchset, we lock the whole core if
cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or
less serialized for the core and that attributes to the performance loss
that we are seeing. We also saw that newidle_balance() takes considerably
long time in load_balance() due to the rq spinlock contention. Do you think
it would help if the core-wide locking was only performed when absolutely
needed ?

In terms of isolation, we measured the time a thread spends co-scheduled
with either a thread from the same group, the idle thread or a thread from
another group. This is what we see for 60 seconds of a specific busy VM
pinned to a whole NUMA node (all its threads):

no core scheduling:
- local neighbors (19.989 % of process runtime)
- idle neighbors (47.197 % of process runtime)
- foreign neighbors (22.811 % of process runtime)

core scheduling enabled:
- local neighbors (6.763 % of process runtime)
- idle neighbors (93.064 % of process runtime)
- foreign neighbors (0.236 % of process runtime)

As a separate test, we tried to pin all the vcpu threads to a set of cores
(6 cores for 12 vcpus):
no core scheduling:
- local neighbors (88.299 % of process runtime)
- idle neighbors (9.334 % of process runtime)
- foreign neighbors (0.197 % of process runtime)

core scheduling enabled:
- local neighbors (84.570 % of process runtime)
- idle neighbors (15.195 % of process runtime)
- foreign neighbors (0.257 % of process runtime)

Thanks,

Julien