Re: INFO: rcu detected stall in do_idle

From: Daniel Bristot de Oliveira
Date: Wed Oct 31 2018 - 13:58:25 EST

Next message: Punit Agrawal: "[PATCH v9 3/8] KVM: arm/arm64: Introduce helpers to manipulate page table entries"
Previous message: Punit Agrawal: "[PATCH v9 2/8] KVM: arm/arm64: Re-factor setting the Stage 2 entry to exec on fault"
In reply to: Peter Zijlstra: "Re: INFO: rcu detected stall in do_idle"
Next in thread: Peter Zijlstra: "Re: INFO: rcu detected stall in do_idle"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/31/18 5:40 PM, Juri Lelli wrote:
> On 31/10/18 17:18, Daniel Bristot de Oliveira wrote:
>> On 10/30/18 12:08 PM, luca abeni wrote:
>>> Hi Peter,
>>>
>>> On Tue, 30 Oct 2018 11:45:54 +0100
>>> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>> [...]
>>>>> 2. This is related to perf_event_open syscall reproducer does
>>>>> before becoming DEADLINE and entering the busy loop. Enabling of
>>>>> perf swevents generates lot of hrtimers load that happens in the
>>>>> reproducer task context. Now, DEADLINE uses rq_clock() for
>>>>> setting deadlines, but rq_clock_task() for doing runtime
>>>>> enforcement. In a situation like this it seems that the amount of
>>>>> irq pressure becomes pretty big (I'm seeing this on kvm, real hw
>>>>> should maybe do better, pain point remains I guess), so rq_clock()
>>>>> and rq_clock_task() might become more a more skewed w.r.t. each
>>>>> other. Since rq_clock() is only used when setting absolute
>>>>> deadlines for the first time (or when resetting them in certain
>>>>> cases), after a bit the replenishment code will start to see
>>>>> postponed deadlines always in the past w.r.t. rq_clock(). And this
>>>>> brings us back to the fact that the task is never stopped, since it
>>>>> can't keep up with rq_clock().
>>>>>
>>>>> - Not sure yet how we want to address this [1]. We could use
>>>>> rq_clock() everywhere, but tasks might be penalized by irq
>>>>> pressure (theoretically this would mandate that irqs are
>>>>> explicitly accounted for I guess). I tried to use the skew
>>>>> between the two clocks to "fix" deadlines, but that puts us at
>>>>> risks of de-synchronizing userspace and kernel views of deadlines.
>>>>
>>>> Hurm.. right. We knew of this issue back when we did it.
>>>> I suppose now it hurts and we need to figure something out.
>>>>
>>>> By virtue of being a real-time class, we do indeed need to have
>>>> deadline on the wall-clock. But if we then don't account runtime on
>>>> that same clock, but on a potentially slower clock, we get the
>>>> problem that we can run longer than our period/deadline, which is
>>>> what we're running into here I suppose.
>>>
>>> I might be hugely misunderstanding something here, but in my impression
>>> the issue is just that if the IRQ time is not accounted to the
>>> -deadline task, then the non-deadline tasks might be starved.
>>>
>>> I do not see this as a skew between two clocks, but as an accounting
>>> thing:
>>> - if we decide that the IRQ time is accounted to the -deadline
>>> task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled),
>>> then the non-deadline tasks are not starved (but of course the
>>> -deadline tasks executes for less than its reserved time in the
>>> period);
>>> - if we decide that the IRQ time is not accounted to the -deadline task
>>> (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then
>>> the -deadline task executes for the expected amount of time (about
>>> 60% of the CPU time), but an IRQ load of 40% will starve non-deadline
>>> tasks (this is what happens in the bug that triggered this discussion)
>>>
>>> I think this might be seen as an adimission control issue: when
>>> CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for
>>> in the admission control (because it ends up in the task's runtime),
>>> but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not
>>> accounted for in the admission test (the IRQ handler becomes some sort
>>> of entity with a higher priority than -deadline tasks, on which no
>>> accounting or enforcement is performed).
>>>
>>
>> I am sorry for taking to long to join in the discussion.
>>
>> I agree with Luca. I've seem this behavior two time before. Firstly when we were
>> trying to make the rt throttling to have a very short runtime for non-rt
>> threads, and then in the proof of concept of the semi-partitioned scheduler.
>>
>> Firstly, I started thinking on this as a skew between both clocks and disabled
>> IRQ_TIME_ACCOUNTING. But by ignoring IRQ accounting, we are assuming that the
>> IRQ runtime will be accounted as the thread's runtime. In other words, we are
>> just sweeping the trash under the rug, where the rug is the worst case execution
>> time estimation/definition (which is an even more complex problem). In the
>> Brazilian part of the Ph.D we are dealing with probabilistic worst case
>> execution time, and to be able to use probabilistic methods, we need to remove
>> the noise of the IRQs in the execution time [1]. So, IMHO, using
>> CONFIG_IRQ_TIME_ACCOUNTING is a good thing.
>>
>> The fact that we have barely no control of the execution of IRQs, at first
>> glance, let us think that the idea of considering an IRQ as a task seems to be
>> absurd. But, it is not. The IRQs run a piece of code that is, in the vast
>> majority of the case, not related to the current thread, so it runs another
>> "task". In the occurrence of more than one IRQ concurrently, the processor
>> serves the IRQ in a predictable order [2], so the processor schedules the IRQs
>> as a "task". Finally, there are precedence constraints among threads and IRQs.
>> For instance, the latency can be seen as the response time of the timer IRQ
>> handler, plus the delta of the return of the handler and the starting of the
>> execution of cyclictest [3]. In the theory, the idea of precedence constraints
>> is also about "task".
>>
>> So IMHO, IRQs can be considered as a task (I am considering in my model), and
>> the place to account this would be in the admission test.
>>
>> The problem is that, for the best of my knowledge, there is no admissions test
>> for such task model/system:
>>
>> Two level of schedulers. A high priority scheduler that schedules a non
>> preemptive task set (IRQ) under a fixed priority (the processor scheduler do it,
>> and on intel it is a fixed priority). A lower priority task set (threads)
>> scheduled by the OS.
>>
>> But assuming that our current admission control is more about a safe guard than
>> an exact admission control - that is, for multiprocessor it is necessary, but
>> not sufficient. (Theoretically, it works for uniprocessor, but... there is a
>> paper of Rob Davis somewhere that shows that if we have "context switch" (and so
>> scheduler for our case)) with different costs, the many things does not hold
>> true, for instance, Deadline Monotonic is not optimal... but I will have to read
>> more to enter in this point, anyway, multiprocessor is only necessary).
>>
>> With this in mind: we do *not* use/have an exact admission test for all cases.
>> By not having an exact admission test, we assume the user knows what he/she is
>> doing. In this case, if they have a high load of IRQs... they need to know that:
>>
>> 1) Their periods should be consistent with the "interference" they might receive.
>> 2) Their tasks can miss the deadline because of IRQs (and there is no way to
>> avoid this without "throttling" IRQs...)
>>
>> So, is it worth to put a duct tape for this case?
>>
>> My fear is that, by putting a duct tape here, we would turn things prone to more
>> complex errors/undeterminism... so...
>>
>> I think we have another point to add to the discussion at plumbers, Juri.
>
> Yeah, sure. My fear in a case like this though is that the task that
> ends up starving other is "creating" IRQ overhead on itself. Kind of
> DoS, no?

I see your point.

But how about a non-rt thread that creates a lot of timers, then a DL task
arrives and preempts it, receiving the interfere from interrupts that were
caused by the previous thread?

Actually, enabling/disabling sched stats in a loop generates a IPI storm on all
(other) CPUs because of updates in jump labels (we will reduce/bound that with
the batch of jump label update, but still, the problem will exist). But not
only, iirc we can cause this with a madvise to cause the flush of pages.

> I'm seeing something along the lines of what Peter suggested as a last
> resort measure we probably still need to put in place.

I meant, I am not against the/a fix, i just think that... it is more complicated
that it seems.

For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating
IPIs because of static key update, and a good dl thread B in the CPU 1.

In this case, the thread B could run less than what was reserved for it, but it
was not causing the interrupts. It is not fair to put a penalty in the thread B.

The same is valid for a dl thread running in the same CPU that is receiving a
lot of network packets to another application, and other legit cases.

In the end, if we want to avoid non-rt threads starving, we need to prioritize
them some time, but in this case, we return to the DL server for non-rt threads.

Thoughts?

Thanks,
-- Daniel

> Thanks,
>
> - Juri
>

Next message: Punit Agrawal: "[PATCH v9 3/8] KVM: arm/arm64: Introduce helpers to manipulate page table entries"
Previous message: Punit Agrawal: "[PATCH v9 2/8] KVM: arm/arm64: Re-factor setting the Stage 2 entry to exec on fault"
In reply to: Peter Zijlstra: "Re: INFO: rcu detected stall in do_idle"
Next in thread: Peter Zijlstra: "Re: INFO: rcu detected stall in do_idle"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]