Re: kernel/scheduler: The Linux scheduler doesn't scale to more than 8 cores?

From: Vincent Guittot
Date: Tue Nov 02 2021 - 11:25:31 EST


Hi Romain,

On Tue, 2 Nov 2021 at 14:23, Morotti, Romain (London)
<Romain.Morotti@xxxxxxx> wrote:
>
> Hello,
>
>
>
> Apologies if it's the wrong place, this is my first time trying to submit a bug/email to the kernel.
>
>
>
> I was doing some tuning on compute grids and that got me looking into the scheduler.
>
>
>
> There are few sysctl settings to adjust how long processes can be scheduled. (sched_latency_ns, sched_min_granularity_ns, etc...)
>
> They are set to few milliseconds by default and adjusted automatically with the number of cores.
>
>
>
> doc: <value> ms * (1 + ilog(ncpus)), units: nanoseconds)
>
>
>
> Problem: The scaling doesn't scale as intended, it's capped to only detect 8 CPUs at most.
>
>
>
>
>
> In this function get_update_sysctl_factor().
>
> https://github.com/torvalds/linux/blob/8cb1ae19bfae92def42c985417cd6e894ddaa047/kernel/sched/fair.c#L174
>
>
>
> static unsigned int get_update_sysctl_factor(void)
>
> {
>
> unsigned int cpus = min_t(unsigned int, num_online_cpus(), 8);
>
> unsigned int factor;
>
> [...]
>
> case SCHED_TUNABLESCALING_LOG:
>
> default:
>
> factor = 1 + ilog2(cpus);
>
> break;}
>
>
>
>
>
>
>
> From the first line, the CPU count is capped to 8.
>
> Thus the scheduler scaling won't be more than a factor of 4, no matter how many CPUs there are. (1+log2(8))
>
> There's also a linear scaling option, that is similarly capped to 8 times, because 8 CPUs.

That's a good point and I never realized this limitation in the scaling factor.

Peter mentioned that there were interactivity problems with large
scale factors at the time it was added. But the scheduler has changed
since and it would be interesting to run benchmarks on more recent
platforms with larger factor

>
>
>
>
>
> Looking up the history. That code goes back to November 2011 from this commit, shipped with kernel v3.3
>
> https://github.com/torvalds/linux/commit/029632fbb7b7c9d85063cc9eb470de6c54873df3
>
> https://github.com/torvalds/linux/blob/029632fbb7b7c9d85063cc9eb470de6c54873df3/kernel/sched_fair.c#L122
>
>
>
> That was a fairly large patch that added and moved code around. That’s not the original source.
>
> Looking further, I found these commits around December 2009 to make the scheduler more configurable, shipped with kernel v2.6.33
>
> https://github.com/torvalds/linux/commit/1983a922a1bc843806b9a36cf3a370b242783140
>
> https://github.com/torvalds/linux/blob/1983a922a1bc843806b9a36cf3a370b242783140/kernel/sched.c#L7035
>
>
>
> I think the ultimate origin is this commit from December 2009:
>
> https://github.com/torvalds/linux/commit/0bcdcf28c979869f44e05121b96ff2cfb05bd8e6
>
> --- unsigned int factor = 1 + ilog2(num_online_cpus());
>
> +++ unsigned int cpus = min(num_online_cpus(), 8U);
>
> +++ unsigned int factor = 1 + ilog2(cpus);
>
>
>
>
>
> Circa 2009, a 8+ cores CPU was about the best money could buy (perhaps more on multi-socket servers if you've got the budget).
>
> Servers have had tens of cores for a long time now. For reference, the latest x64 CPUs have up to 64 cores (128 threads) https://en.wikipedia.org/wiki/Epyc .
>
>
>
> My guess is that the limit was hardcoded to reflect the hardware available at the time, and it's been forgotten ever since.
>
> The doc was written much later and it's simply incorrect. Didn't notice the limit? https://github.com/torvalds/linux/commit/2b4d5b2582deffb77b3b4b48a59cd36e9e1e14d9
>
>
>
> I think this may need adjustment to reflect current hardware? And the doc should be corrected?
>
>
>
>
>
> Regards.
>
>
>
>
>
>
>
> This email has been sent by a member of the Man group (“Man”). Man's parent company, Man Group plc, is registered in Jersey (company number 127570) with its registered office at 22 Grenville Street, St Helier, Jersey, JE4 8PX. The contents of this email are for the named addressee(s) only. It contains information which may be confidential and privileged. If you are not the intended recipient, please notify the sender immediately, destroy this email and any attachments and do not otherwise disclose or use them. Email transmission is not a secure method of communication and Man cannot accept responsibility for the completeness or accuracy of this email or any attachments. Whilst Man makes every effort to keep its network free from viruses, it does not accept responsibility for any computer virus which might be transferred by way of this email or any attachments. This email does not constitute a request, offer, recommendation or solicitation of any kind to buy, subscribe, sell or redeem any investment instruments or to perform other such transactions of any kind. Man reserves the right to monitor, record and retain all electronic and telephone communications through its network in accordance with applicable laws and regulations.
>
> During the course of our business relationship with you, we may process your personal data, including through the monitoring of electronic communications. We will only process your personal data to the extent permitted by laws and regulations; for the purposes of ensuring compliance with our legal and regulatory obligations and internal policies; and for managing client relationships. For further information please see our Privacy Notice: https://www.man.com/privacy-policy