Peter Williams wrote:
Helge Hafting wrote:
On Wed, Dec 21, 2005 at 05:32:52PM +1100, Peter Williams wrote:
Trond Myklebust wrote:
[...]
Sorry. That theory is just plain wrong. ALL of those case _ARE_
interactive sleeps.
It's not a theory. It's a result of observing a -j 16 build with the sources on an NFS mounted file system with top with and without the patches and comparing that with the same builds with the sources on a local file system. Without the patches the tasks in the kernel build all get the same dynamic priority as the X server and other interactive programs when the sources are on an NFS mounted file system. With the patches they generally have dynamic priorities between 6 to 10 higher than the X server and other interactive programs.
A process waiting for NFS data looses cpu time, which is spent on running something else. Therefore, it gains some priority so it won't be
forever behind when it wakes up. Same as for any other io waiting.
That's more or less independent of this issue as the distribution of CPU to tasks is largely determined by the time slice mechanism and the dynamic priority is primarily about latency. (This distinction is a little distorted by the fact that, under some circumstances, "interactive" tasks don't get moved to the expired list at the end of their time slice but this usually won't matter as genuine interactive tasks aren't generally CPU hogs.) In other words, the issue that you raised is largely solved by the time tasks spend on the active queue before moving to the expired queue rather than the order in which they run when on the active queue.
This problem is all about those tasks getting an inappropriate boost to improve their latency because they are mistakenly believed to be interactive.
One of the unfortunate side effects of this is that it can effect scheduler fairness because if these tasks get sufficient bonus points the TASK_INTERACTIVE() macro will return true for them and they will be rescheduled on the active queue instead of the expired queue at the end of the time slice (provided EXPIRED_STARVING()) doesn't prevent this). This will have an adverse effect on scheduling fairness.
The ideal design of the scheduler would be for the fairness mechanism and the interactive responsiveness mechanism to be independent but this is not the case due to the fact that requeueing interactive tasks on the expired array could add unacceptably to their latency. As I said above this slight divergence from the ideal of perfect independence shouldn't matter as genuine interactive processes aren't very CPU intensive.
In summary, inappropriate identification of CPU intensive tasks as interactive has two bad effects: 1) responsiveness problems for genuine interactive tasks due to the extra competition at their dynamic priority and 2) a degradation of scheduling fairness; not just one.
For an example of the effect of inappropriate identification of CPU hogs as interactive tasks see the thread "[SCHED] Totally WRONG priority calculation with specific test-case (since 2.6.10-bk12)" in this list.
Having had a closer think about the way the scheduler works I'm now of the opinion that completely ignoring sleeps labelled as TASK_NONINTERACTIVE may be a mistake and that it might be more appropriate to treat them the same as TASK_UNITERRUPTIBLE but I'll bow to Ingo on this as he would have a better understanding of the issues involved.
Perhaps expecting a 16-way parallel make to have "no impact" is
a bit optimistic. How about nicing the make, explicitly telling
linux that it isn't important?
Yes, but that shouldn't be necessary. If I do the same build on a local file system everything works OK and the tasks in the build have dynamic priorities 8 to 10 slots higher than the X server and other interactive programs.
Further analysis indicates that this is not a complete solution as the tasks would still be identified as interactive and given a bonus. Although the change of nice value would be sufficient to stop these tasks competing with the genuine interactive tasks, they would probably still get a positive return value from TASK_INTERACTIVE() (as it's effectively based on the bonus acquired i.e. difference between prio and static_prio) and hence preferential treatment at the end of their time slice with a consequent degradation of scheduling fairness.
Or how about giving important
tasks extra priority?
Only root can do that. But some operating systems do just that e.g. Solaris has an IA scheduling class (which all X based programs are run in) that takes precedence over programs in the TS class (which is the equivalent of Linus's SCHED_NORMAL). I'm not sure how they handle the privileges issues related to stopping inappropriate programs misusing the IA class. IA is really just TS with a boost which is effectively just the reverse implementation of what the new SCHED_BATCH achieves. Arguably, SCHED_BATCH is the superior way of doing this as it doesn't cause any privilege issues as shifting to SCHED_BATCH can be done by the owner of the task.
The main drawback to the SCHED_BATCH approach is that it (currently) requires the user to explicitly set it on the relevant tasks. It's long term success would be greatly enhanced if programmers could be convinced to have their programs switch themselves to SCHED_BATCH unless they are genuine interactive processes.