The CPU_X is totally wasted in exclusive mode, the resource
efficiency are really poor.
Thus what we need, is a way to ease confliction in share mode,
make groups as exclusive as possible, to gain both performance
and resource efficiency.
The main idea of group balancer is to fulfill this requirement
by balancing groups of tasks among groups of CPUs, consider this
as a dynamic demi-exclusive mode.
Also look at the oracle soft affinity patches
Just like balance the task among CPUs, now with GB a user can
put CPU X,Y,Z into three partitions, and balance group A,B,C
into these partition, to make them as exclusive as possible.
The design is very likely to the numa balancing, task trigger
work to settle it's group into a proper partition (minimum
predicted load), then try migrate itself into it. To gradually
settle groups into the most exclusively partition.
No words on the interaction between this and numa balancing. Numa
balancing is already a bit tricky because it and the regular load
balancer will have conflicting goals, some of that is mitigated by
teaching the regular balancing about some of that.
I can't help but feel you're making the whole thing look like a 3 body
problem. Also, regular balancing in the face of affinities is already
somewhat dicy. All that needs exploring.
How To Use:
To create partition, for example run:
echo disable > /proc/gb_ctrl
echo "0-15;16-31;32-47;48-63;" > /proc/gb_ctrl
echo enable > /proc/gb_ctrl
That's just never going to happen; please look at the cpuset partition
stuff.
this will create 4 partitions contain CPUs 0-15,16-31,32-47 and
48-63 separately.
Then enable GB for your cgroup, run
$CPU_CGROUP_PATH/cpu.gb_period_ms
And you can check:
$CPU_CGROUP_PATH/cpu.gb_stat
which give output as:
PART-0 0-15 1008 1086 *
PART-1 16-31 0 2
PART-2 32-47 0 0
PART-3 48-63 0 1024
The partition ID followed by it's CPUs range, load of group, load
of partition and a star mark as preferred.
Testing Results:
In order to enlarge the differences, we do testing on ARM platform
with 128 CPUs, create 8 partition according to cluster info.
Since we pick benchmark which can gain benefit from exclusive mode,
this is more like a functional testing rather than performance, to
show that GB help winback the performance.
Create 8 cgroup each running 'sysbench memory --threads=16 run',
the output of share mode is:
events/s (eps): 4181233.4646
events/s (eps): 3548328.2346
events/s (eps): 4578816.2412
events/s (eps): 4761797.3932
events/s (eps): 3486703.0455
events/s (eps): 3474920.9803
events/s (eps): 3604632.7799
events/s (eps): 3149506.7001
the output of gb mode is:
events/s (eps): 5472334.9313
events/s (eps): 4085399.1606
events/s (eps): 4398122.2170
events/s (eps): 6180233.6766
events/s (eps): 4299784.2742
events/s (eps): 4914813.6847
events/s (eps): 3675395.1191
events/s (eps): 6767666.6229
Create 4 cgroup each running redis-server with 16 io threads,
4 redis-benchmark per each server show average rps as:
share mode gb mode
PING_INLINE : 41154.84 42229.27 2.61%
PING_MBULK : 43042.07 44907.10 4.33%
SET : 34502.00 37374.58 8.33%
GET : 41713.47 45257.68 8.50%
INCR : 41533.26 44259.31 6.56%
LPUSH : 36541.23 39417.84 7.87%
RPUSH : 39059.26 42075.32 7.72%
LPOP : 36978.73 39903.15 7.91%
RPOP : 39553.32 42071.53 6.37%
SADD : 40614.30 44693.33 10.04%
HSET : 39101.93 42401.16 8.44%
SPOP : 42838.90 46560.46 8.69%
ZADD : 38346.80 41685.46 8.71%
ZPOPMIN : 41952.26 46138.14 9.98%
LRANGE_100 : 19364.66 20251.56 4.58%
LRANGE_300 : 9699.57 9935.86 2.44%
LRANGE_500 : 6291.76 6512.48 3.51%
LRANGE_600 : 5619.13 5658.31 0.70%
MSET : 24432.78 26517.63 8.53%
Signed-off-by: Cruz Zhao <cruzzhao@xxxxxxxxxxxxxxxxx>
Signed-off-by: Tianchen Ding <dtcccc@xxxxxxxxxxxxxxxxx>
Signed-off-by: Michael Wang <yun.wang@xxxxxxxxxxxxxxxxx>
Invalid SoB chain.
I'll not really have much time at the moment to look at the code.
Hopefully in a few weeks, but I first need to recover from a 2 week
break and then finish the umcg bits I was working on before that.