Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

From: George Dunlap
Date: Thu Aug 27 2015 - 13:05:44 EST


On Thu, Aug 27, 2015 at 11:24 AM, George Dunlap
<george.dunlap@xxxxxxxxxx> wrote:
> On 08/18/2015 04:55 PM, Dario Faggioli wrote:
>> Hey everyone,
>>
>> So, as a followup of what we were discussing in this thread:
>>
>> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>
>> I started looking in more details at scheduling domains in the Linux
>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>> of interacting, while this thing I'm proposing here is completely
>> independent from them both.
>>
>> In fact, no matter whether vNUMA is supported and enabled, and no matter
>> whether CPUID is reporting accurate, random, meaningful or completely
>> misleading information, I think that we should do something about how
>> scheduling domains are build.
>>
>> Fact is, unless we use 1:1, and immutable (across all the guest
>> lifetime) pinning, scheduling domains should not be constructed, in
>> Linux, by looking at *any* topology information, because that just does
>> not make any sense, when vcpus move around.
>>
>> Let me state this again (hoping to make myself as clear as possible): no
>> matter in how much good shape we put CPUID support, no matter how
>> beautifully and consistently that will interact with both vNUMA,
>> licensing requirements and whatever else. It will be always possible for
>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>> should really not skew his load balancing logic toward any of those two
>> situations, as neither of them could be considered correct (since
>> nothing is!).
>>
>> For now, this only covers the PV case. HVM case shouldn't be any
>> different, but I haven't looked at how to make the same thing happen in
>> there as well.
>>
>> OVERALL DESCRIPTION
>> ===================
>> What this RFC patch does is, in the Xen PV case, configure scheduling
>> domains in such a way that there is only one of them, spanning all the
>> pCPUs of the guest.
>>
>> Note that the patch deals directly with scheduling domains, and there is
>> no need to alter the masks that will then be used for building and
>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
>> the main difference between it and the patch proposed by Juergen here:
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>
>> This means that when, in future, we will fix CPUID handling and make it
>> comply with whatever logic or requirements we want, that won't have any
>> unexpected side effects on scheduling domains.
>>
>> Information about how the scheduling domains are being constructed
>> during boot are available in `dmesg', if the kernel is booted with the
>> 'sched_debug' parameter. It is also possible to look
>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>
>> With the patch applied, only one scheduling domain is created, called
>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>> tell that from the fact that every cpu* folder
>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>> ('domain0'), with all the tweaks and the tunables for our scheduling
>> domain.
>>
>> EVALUATION
>> ==========
>> I've tested this with UnixBench, and by looking at Xen build time, on a
>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>> something similar to this in DomU already, AFAUI).
>>
>> I've run the benchmarks with and without the patch applied ('patched'
>> and 'vanilla', respectively, in the tables below), and with different
>> number of build jobs (in case of the Xen build) or of parallel copy of
>> the benchmarks (in the case of UnixBench).
>>
>> What I get from the numbers is that the patch almost always brings
>> benefits, in some cases even huge ones. There are a couple of cases
>> where we regress, but always only slightly so, especially if comparing
>> that to the magnitude of some of the improvement that we get.
>>
>> Bear also in mind that these results are gathered from Dom0, and without
>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>> we move things in DomU and do overcommit at the Xen scheduler level, I
>> am expecting even better results.
>>
>> RESULTS
>> =======
>> To have a quick idea of how a benchmark went, look at the '%
>> improvement' row of each table.
>>
>> I'll put these results online, in a googledoc spreadsheet or something
>> like that, to make them easier to read, as soon as possible.
>>
>> *** Intel(R) Xeon(R) E5620 @ 2.40GHz
>> *** pCPUs 16 DOM0 vCPUS 16
>> *** RAM 12285 MB DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs -j1 -j6 -j8 -j16** -j24
>> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> 153.72 152.41 35.33 34.93 30.7 30.33 26.79 25.97 26.88 26.21
>> 153.81 152.76 35.37 34.99 30.81 30.36 26.83 26.08 27 26.24
>> 153.93 152.79 35.37 35.25 30.92 30.39 26.83 26.13 27.01 26.28
>> 153.94 152.94 35.39 35.28 31.05 30.43 26.9 26.14 27.01 26.44
>> 153.98 153.06 35.45 35.31 31.17 30.5 26.95 26.18 27.02 26.55
>> 154.01 153.23 35.5 35.35 31.2 30.59 26.98 26.2 27.05 26.61
>> 154.04 153.34 35.56 35.42 31.45 30.76 27.12 26.21 27.06 26.78
>> 154.16 153.5 37.79 35.58 31.68 30.83 27.16 26.23 27.16 26.78
>> 154.18 153.71 37.98 35.61 33.73 30.9 27.49 26.32 27.16 26.8
>> 154.9 154.67 38.03 37.64 34.69 31.69 29.82 26.38 27.2 28.63
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Avg. 154.067 153.241 36.177 35.536 31.74 30.678 27.287 26.184 27.055 26.732
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Std. Dev. 0.325 0.631 1.215 0.771 1.352 0.410 0.914 0.116 0.095 0.704
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> % improvement 0.536 1.772 3.346 4.042 1.194
>> ========================================================================================================================================
>> ====================================================================================================================================================
>> UNIXBENCH
>> ====================================================================================================================================================
>> # parallel copies 1 parallel 6 parrallel 8 parallel 16 parallel** 24 parallel
>> vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> Dhrystone 2 using register variables 2302.2 2302.1 13157.8 12262.4 15691.5 15860.1 18927.7 19078.5 18654.3 18855.6
>> Double-Precision Whetstone 620.2 620.2 3481.2 3566.9 4669.2 4551.5 7610.1 7614.3 11558.9 11561.3
>> Execl Throughput 184.3 186.7 884.6 905.3 1168.4 1213.6 2134.6 2210.2 2250.9 2265
>> File Copy 1024 bufsize 2000 maxblocks 780.8 783.3 1243.7 1255.5 1250.6 1215.7 1080.9 1094.2 1069.8 1062.5
>> File Copy 256 bufsize 500 maxblocks 479.8 482.8 781.8 803.6 806.4 781 682.9 707.7 698.2 694.6
>> File Copy 4096 bufsize 8000 maxblocks 1617.6 1593.5 2739.7 2943.4 2818.3 2957.8 2389.6 2412.6 2371.6 2423.8
>> Pipe Throughput 363.9 361.6 2068.6 2065.6 2622 2633.5 4053.3 4085.9 4064.7 4076.7
>> Pipe-based Context Switching 70.6 207.2 369.1 1126.8 623.9 1431.3 1970.4 2082.9 1963.8 2077
>> Process Creation 103.1 135 503 677.6 618.7 855.4 1138 1113.7 1195.6 1199
>> Shell Scripts (1 concurrent) 723.2 765.3 4406.4 4334.4 5045.4 5002.5 5861.9 5844.2 5958.8 5916.1
>> Shell Scripts (8 concurrent) 2243.7 2715.3 5694.7 5663.6 5694.7 5657.8 5637.1 5600.5 5582.9 5543.6
>> System Call Overhead 330 330.1 1669.2 1672.4 2028.6 1996.6 2920.5 2947.1 2923.9 2952.5
>> System Benchmarks Index Score 496.8 567.5 1861.9 2106 2220.3 2441.3 2972.5 3007.9 3103.4 3125.3
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> % increase (of the Index Score) 14.231 13.110 9.954 1.191 0.706
>> ====================================================================================================================================================
>>
>> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
>> *** pCPUs 24 DOM0 vCPUS 16
>> *** RAM 36851 MB DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs -j1 -j8 -j12 -j24** -j32
>> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> 119.49 119.47 23.37 23.29 20.12 19.85 17.99 17.9 17.82 17.8
>> 119.59 119.64 23.52 23.31 20.16 19.99 18.19 18.05 18.23 17.89
>> 119.59 119.65 23.53 23.35 20.19 20.08 18.26 18.09 18.35 17.91
>> 119.72 119.75 23.63 23.41 20.2 20.14 18.54 18.1 18.4 17.95
>> 119.95 119.86 23.68 23.42 20.24 20.19 18.57 18.15 18.44 18.03
>> 119.97 119.9 23.72 23.51 20.38 20.31 18.61 18.21 18.49 18.03
>> 119.97 119.91 25.03 23.53 20.38 20.42 18.75 18.28 18.51 18.08
>> 120.01 119.98 25.05 23.93 20.39 21.69 19.99 18.49 18.52 18.6
>> 120.24 119.99 25.12 24.19 21.67 21.76 20.08 19.74 19.73 19.62
>> 120.66 121.22 25.16 25.36 21.94 21.85 20.26 20.3 19.92 19.81
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Avg. 119.919 119.937 24.181 23.73 20.567 20.628 18.924 18.531 18.641 18.372
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Std. Dev. 0.351 0.481 0.789 0.642 0.663 0.802 0.851 0.811 0.658 0.741
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> % improvement -0.015 1.865 -0.297 2.077 1.443
>> ========================================================================================================================================
>> ====================================================================================================================================================
>> UNIXBENCH
>> ====================================================================================================================================================
>> # parallel copies 1 parallel 8 parrallel 12 parallel 24 parallel** 32 parallel
>> vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> Dhrystone 2 using register variables 2650.1 2664.6 18967.8 19060.4 27534.1 27046.8 30077.9 30110.6 30542.1 30358.7
>> Double-Precision Whetstone 713.7 713.5 5463.6 5455.1 7863.9 7923.8 12725.1 12727.8 17474.3 17463.3
>> Execl Throughput 280.9 283.8 1724.4 1866.5 2029.5 2367.6 2370 2521.3 2453 2506.8
>> File Copy 1024 bufsize 2000 maxblocks 891.1 894.2 1423 1457.7 1385.6 1482.2 1226.1 1224.2 1235.9 1265.5
>> File Copy 256 bufsize 500 maxblocks 546.9 555.4 949 972.1 882.8 878.6 821.9 817.7 784.7 810.8
>> File Copy 4096 bufsize 8000 maxblocks 1743.4 1722.8 3406.5 3438.9 3314.3 3265.9 2801.9 2788.3 2695.2 2781.5
>> Pipe Throughput 426.8 423.4 3207.9 3234 4635.1 4708.9 7326 7335.3 7327.2 7319.7
>> Pipe-based Context Switching 110.2 223.5 680.8 1602.2 998.6 2324.6 3122.1 3252.7 3128.6 3337.2
>> Process Creation 130.7 224.4 1001.3 1043.6 1209 1248.2 1337.9 1380.4 1338.6 1280.1
>> Shell Scripts (1 concurrent) 1140.5 1257.5 5462.8 6146.4 6435.3 7206.1 7425.2 7636.2 7566.1 7636.6
>> Shell Scripts (8 concurrent) 3492 3586.7 7144.9 7307 7258 7320.2 7295.1 7296.7 7248.6 7252.2
>> System Call Overhead 387.7 387.5 2398.4 2367 2793.8 2752.7 3735.7 3694.2 3752.1 3709.4
>> System Benchmarks Index Score 634.8 712.6 2725.8 3005.7 3232.4 3569.7 3981.3 4028.8 4085.2 4126.3
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> % increase (of the Index Score) 12.256 10.269 10.435 1.193 1.006
>> ====================================================================================================================================================
>>
>> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
>> *** pCPUs 48 DOM0 vCPUS 16
>> *** RAM 393138 MB DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs -j1 -j20 -j24 -j48** -j62
>> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> 267.78 233.25 36.53 35.53 35.98 34.99 33.46 32.13 33.57 32.54
>> 268.42 233.92 36.82 35.56 36.12 35.2 34.24 32.24 33.64 32.56
>> 268.85 234.39 36.92 35.75 36.15 35.35 34.48 32.86 33.67 32.74
>> 268.98 235.11 36.96 36.01 36.25 35.46 34.73 32.89 33.97 32.83
>> 269.03 236.48 37.04 36.16 36.45 35.63 34.77 32.97 34.12 33.01
>> 269.54 237.05 40.33 36.59 36.57 36.15 34.97 33.09 34.18 33.52
>> 269.99 238.24 40.45 36.78 36.58 36.22 34.99 33.69 34.28 33.63
>> 270.11 238.48 41.13 39.98 40.22 36.24 38 33.92 34.35 33.87
>> 270.96 239.07 41.66 40.81 40.59 36.35 38.99 34.19 34.49 37.24
>> 271.84 240.89 42.07 41.24 40.63 40.06 39.07 36.04 34.69 37.59
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Avg. 269.55 236.688 38.991 37.441 37.554 36.165 35.77 33.402 34.096 33.953
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Std. Dev. 1.213 2.503 2.312 2.288 2.031 1.452 2.079 1.142 0.379 1.882
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> % improvement 12.191 3.975 3.699 6.620 0.419
>> ========================================================================================================================================
>
> I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
> tests, you change the -j number (apparently) based on the number of
> pcpus available to Xen. Wouldn't it make more sense to stick with
> 1/6/8/16/24? That would allow us to have actually comparable numbers.
>
> But in any case, it seems to me that the numbers do show a uniform
> improvement and no regressions -- I think this approach looks really
> good, particularly as it is so small and well-contained.

That said, it's probably a good idea to make this optional somehow, so
that if people do decide to do a pinning / partitioning approach, the
guest scheduler actually can take advantage of topological
information.

-George
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/