On Tue, Feb 06, 2018 at 04:30:03PM -0800, Subhra Mazumdar wrote:This logic may not be working as well as you may think it to be. I had
I meant the SMT balance patch. That does comparison with only one otherWe only do the full scan if we think to know there is indeed an idle
random core and takes the decision in O(1). Any potential scan of all cores
or cpus is O(n) and doesn't scale and will only get worse in future. That
applies to both select_idle_core() and select_idle_cpu().
core to be had, and if there are idle cores the machine isn't terribly
busy.
If there are no idle cores, we do not in fact scan everything. We limit
the amount of scanning based on the average idle time with a minimum of
4.
The rotor randomness is not the main point here. I think the benchmark
(Except when you switch off things like SIS_PROP, then you scan
unconditionally and reduce tail latency at the cost of throughput --
like said, some people want this).
O(1) sounds nice, but it has horrible worst case latencies.
And like I said, I don't see how a rotor is particularly more random
than the previous cpu the task you happen to be waking ran on.
Sorry about thatIs there any reason this randomized approach is not acceptable even ifLooking at just one other core has a fairly high chance of not finding
benchmarks show improvement? Are there other benchmarks I should try?
idle. I really cannot remember all the benchmarks, but Mike did
something a little less random but still O(1) a few years back and that
crashed and burned.
Also your suggestion to keep the SMT utilization but still do a traversal ofPlease learn to use your email client, that's near unreadable for
cores
in select_idle_core() while remembering the least loaded core will still
have
the problem of potentially traversing all cores. I can compare this with a
core
level only SMT balancing, is that useful to decide? I will also test on
SPARC
machines with higher degree of SMT.
whitespace damage, time is really too precious to try and untangle crap
like that.
I also had uperf and Oracle DB TPC-C numbers in the patch which showed
You had also mentioned to do it for only SMT >2, not sure I understand whyFor SMT2 you don't need the occupation counter with atomic crud, a
as even for SMT=2 (intel) benchmarks show improvement. This clearly shows
the scalability problem.
simple !atomic core-idle works just fine. And your 'patch' had soo many
moving parts it was too hard to say which aspect changed what.
hackbench really isn't _that_ interesting a benchmark and its fairly
easy to make it go faster, but doing so typically wrecks something else.
Ok, will find out and run them.
There's a whole array of netperf benchmarks which you have to run at
different utilization levels, there's the sysbench oltp stuff, which you
can run on MariaDB and PostgreSQL, again at different utilization levels
(both databases behave quite differently). There's ebizzy, tbench,
dbench and others.
There's also NUMA stuff, like NAS benchmarks and specJBB nonsense.
There's the facebook schbench, which is a bit finnicky to set up.
And I'm sure I'm forgetting half, look at the patches and regression
reports for more clues, that's what Google is for. You could have
figured all that out yourself, much of these are listed in the
changelogs and if you'd bothered to find lkml regression reports you
could find more.