Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
From: Luis Machado
Date: Wed Mar 20 2024 - 04:13:08 EST
On 3/20/24 07:04, Tobias Huschle wrote:
> On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
>> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <huschle@xxxxxxxxxxxxx> wrote:
>>>
>>> On 2024-03-18 15:45, Luis Machado wrote:
>>>> On 3/14/24 13:45, Tobias Huschle wrote:
>>>>> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
>>>>>> On 2/28/24 16:10, Tobias Huschle wrote:
>>>>>>>
>>>>>>> Questions:
>>>>>>> 1. The kworker getting its negative lag occurs in the following
>>>>>>> scenario
>>>>>>> - kworker and a cgroup are supposed to execute on the same CPU
>>>>>>> - one task within the cgroup is executing and wakes up the
>>>>>>> kworker
>>>>>>> - kworker with 0 lag, gets picked immediately and finishes its
>>>>>>> execution within ~5000ns
>>>>>>> - on dequeue, kworker gets assigned a negative lag
>>>>>>> Is this expected behavior? With this short execution time, I
>>>>>>> would
>>>>>>> expect the kworker to be fine.
>>>>>>
>>>>>> That strikes me as a bit odd as well. Have you been able to determine
>>>>>> how a negative lag
>>>>>> is assigned to the kworker after such a short runtime?
>>>>>>
>>>>>
>>>>> I did some more trace reading though and found something.
>>>>>
>>>>> What I observed if everything runs regularly:
>>>>> - vhost and kworker run alternating on the same CPU
>>>>> - if the kworker is done, it leaves the runqueue
>>>>> - vhost wakes up the kworker if it needs it
>>>>> --> this means:
>>>>> - vhost starts alone on an otherwise empty runqueue
>>>>> - it seems like it never gets dequeued
>>>>> (unless another unrelated task joins or migration hits)
>>>>> - if vhost wakes up the kworker, the kworker gets selected
>>>>> - vhost runtime > kworker runtime
>>>>> --> kworker gets positive lag and gets selected immediately next
>>>>> time
>>>>>
>>>>> What happens if it does go wrong:
>>>>> From what I gather, there seem to be occasions where the vhost either
>>>>> executes suprisingly quick, or the kworker surprinsingly slow. If
>>>>> these
>>>>> outliers reach critical values, it can happen, that
>>>>> vhost runtime < kworker runtime
>>>>> which now causes the kworker to get the negative lag.
>>>>>
>>>>> In this case it seems like, that the vhost is very fast in waking up
>>>>> the kworker. And coincidentally, the kworker takes, more time than
>>>>> usual
>>>>> to finish. We speak of 4-digit to low 5-digit nanoseconds.
>>>>>
>>>>> So, for these outliers, the scheduler extrapolates that the kworker
>>>>> out-consumes the vhost and should be slowed down, although in the
>>>>> majority
>>>>> of other cases this does not happen.
>>>>
>>>> Thanks for providing the above details Tobias. It does seem like EEVDF
>>>> is strict
>>>> about the eligibility checks and making tasks wait when their lags are
>>>> negative, even
>>>> if just a little bit as in the case of the kworker.
>>>>
>>>> There was a patch to disable the eligibility checks
>>>> (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@xxxxxxxxxxxx/),
>>>> which would make EEVDF more like EVDF, though the deadline comparison
>>>> would
>>>> probably still favor the vhost task instead of the kworker with the
>>>> negative lag.
>>>>
>>>> I'm not sure if you tried it, but I thought I'd mention it.
>>>
>>> Haven't seen that one yet. Unfortunately, it does not help to ignore the
>>> eligibility.
>>>
>>> I'm inclined to rather propose propose a documentation change, which
>>> describes that tasks should not rely on woken up tasks being scheduled
>>> immediately.
>>
>> Where do you see such an assumption ? Even before eevdf, there were
>> nothing that ensures such behavior. When using CFS (legacy or eevdf)
>> tasks, you can't know if the newly wakeup task will run 1st or not
>>
>
> There was no guarantee of course. place_entity was reducing the vruntime of
> woken up tasks though, giving it a slight boost, right?. For the scenario
> that I observed, that boost was enough to make sure, that the woken up tasks
> gets scheduled consistently. This might still not be true for all scenarios,
> but in general EEVDF seems to be stricter with woken up tasks.
It seems that way, as EEVDF will do eligibility and deadline checks before scheduling a task, so
a task would have to satisfy both of those checks.
I think we have some special treatment for when a task initially joins the competition, in which
case we halve its slice. But I don't think there is any special treatment for woken tasks
anymore.
There was also a fix (63304558ba5dcaaff9e052ee43cfdcc7f9c29e85) to try to reduce the number of
wake up preemptions under some conditions, under the RUN_TO_PARITY feature.