Re: [PATCH RFC] sched: deferred set priority (dprio)
From: Sergey Oboguev
Date: Wed Aug 06 2014 - 21:26:44 EST
On Tue, Aug 5, 2014 at 10:41 PM, Mike Galbraith
<umgwanakikbuti@xxxxxxxxx> wrote:
>> > SCHED_NORMAL where priority escalation does not work as preemption proofing
>>
>> Remember, DPRIO is not for lock holders only.
>>
>> Using DPRIO within SCHED_NORMAL policy would make sense for an application that
>> has "soft" time-urgent section where it believes strong protection
>> from preemption
>> is not really necessary, and just a greater claim to CPU time share would do,
>> in cases where the application does not know beforehand if the section will be
>> short or long, and in majority of cases is short (sub-millisecond), but
>> occasionally can take longer.
>
> Every single time that SCHED_NORMAL task boosts its priority (nice)
> during a preemption, the math has already been done, vruntime has
> already been adjusted.
> Sure, when it gets the CPU back, its usage will
> be weighed differently, it will become more resistant to preemption, but
> in no way immune. There is nothing remotely deterministic about this,
> making it somewhat of an oxymoron when combined with critical section.
But you overlooked the point I was trying to convey in the paragraph you
are responding to.
Apart from SCHED_NORMAL being a marginal use case, if it is used at all, I do
not see it being used for lock-holding or similar critical section where an
application wants to avoid the preemption.
I can see DPRIO(SCHED_NORMAL) being used in the same cases as an application
would use nice for a temporary section, i.e. when it has a job that needs to be
processed relatively promptly over some time interval but not really
super-urgently and hard guarantees are not needed, i.e. when the application
simply wants to have an improved claim for CPU resources compared to normal
threads over let us say the next half-second or so. It is ok if the application
gets preempted, all it cares about is a longer timeframe ("next half-second")
rather than shorter and immediate timeframe ("next millisecond").
The only reason why anyone would want to use DPRIO instead of regular nice in
this case is because it might be unknown beforehand whether the job will be
short or might take a longer time, with majority of work items being very short
but occasionally taking longer. In this case using DPRIO would let to cut the
overhead for majority of section instances. To reiterate, this is a marginal
and most likely rare use case, but given the existence of uniform interface
I just do not see why to block it on purpose.
> If some kthread prioritizes _itself_ and mucks up application
> performance, file a bug report, that kthread is busted. Anything a user
> or application does with realtime priorities is on them.
kthreads do not need RT, they just use spinlocks ;-)
On a serious note though, I am certainly not saying that injudicious use of RT
(or even nice) cannot disrupt the system, but is it reason enough to summarily
condemn the judicious use as well?
>> I disagree. The exact problem is that it is not a developer who initiates the
>> preemption, but the kernel or another part of application code that is unaware
>> of other thread's condition and doing it blindly, lacking the information about
>> the state of the thread being preempted and the expected cost of its preemption
>> in this state. DPRIO is a way to communicate this information.
> What DPRIO clearly does NOT do is to describe critical sections to the
> kernel.
First of all let's reflect that your argument is not with DPRIO as such. DPRIO
after all is not a separate scheduling mode, but just a method to reduce the
overhead of regular set_priority calls (i.e. sched_setattr & friends).
You argument is with the use of elevated priority as such, and you are saying
that using RT priority range (or high nice) does not convey to the kernel the
information about the critical section.
I do not agree with this, not wholly anyway. First of all, it is obvious that
set_priority does convey some information about the section, so perhaps a more
accurate re-formulation of your argument could be that it is imperfect,
insufficient information.
Let's try to imagine then what could make more perfect information. It
obviously should be some cost function describing the cost that would be
incurred if the task gets preempted. Something that would say (if we take the
simplest form) "if you preempt me within the next T microseconds (unless I
cancel or modify this mode), this preemption would incur cost X upfront further
accruing at a rate Y".
One issue I see with this approach is that in real life it might be very hard
for a developer to quantify the values for X, Y and T. Developer can easily
know that he wants to avoid the preemption in a given section, but actually
quantifying the cost of preemption (X, Y) would take a lot of effort
(benchmarking) and furthermore really cannot be assigned statically, as the
cost varies depending on the load pattern and site-specific configuration.
Furthermore, when dealing with multiple competing contexts, developer can
typically tell that task A is more important than task B, but quantifying the
measure of their relative importance might be quite difficult.
Likewise, quantifying "T" is likely to be similarly difficult.
(And then even suppose the developer knew that the section completes within let
us say 5 ms at three sigmas, is this reason good enough to preempt the task
at 6 ms for the sake of a normal timesharing thread? - I am uncertain.)
Thus it appears to me that even such an interface existed today, developers
would be daunted by it and prefer to use RT instead as something more
manageable/usable, controllable and predictable.
But then, suppose such an interface existed and task expressing their critical
section information through it were -- within their authorized quotas for T and
X/Y -- given precedence over normal threads but preemptible by RT or DL tasks.
Would not it pretty much amount to the existence of low-RT range sitting just
below regular RT range, low-RT range that tasks could enter for a time? Just
like they can enter regular RT range now with set_priority, also for a time.
Would it be really different from judicious use of existing RT, where tasks
controlling "chainsaws" run at prio range 50-90, while database engine threads
utilize prio range 1-10 in their critical sections?
(The only difference being that after the expiration of interval T task
priority is knocked down -- which a judiciously written application does
anyway, so the difference is just a protection against bugs and runaways -- and
then task becomes more subject to preemption, after which other threads are free
to use PI/PE to resolve the dependency when they know it, and if they
do not then
in a subset of those use cases when spinning of old or incoming waiters cannot
be shut off, it is either back to using plain RT or sustaining uncontrollable
losses.)
I would be most glad to see a usable interface providing information to the
scheduler about a task's critical sections emerge (other than RT), but for the
considerations outlined I am doubtful of the possibility.
Apart from this and coming back to DPRIO, even if solution more satisfactory
than judicious use of RT existed, how long might it take for it to be worked
out? If the history of EDF from ReTiS concept to merging into 3.14 mainline is
a guide, it may take quite a while, so stop-gap solution would have a value
even because of timing considerations until something better emerges... that
is, assuming it can and ever does.
- Sergey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/