Re: [RFC PATCH] arm64: defconfig: Disable fine-grained task level IRQ time accounting

From: Dietmar Eggemann
Date: Wed Aug 05 2020 - 04:50:36 EST


On 04/08/2020 01:59, Valentin Schneider wrote:
>
> On 03/08/20 20:22, Thomas Gleixner wrote:
>> Valentin,
>>
>> Valentin Schneider <valentin.schneider@xxxxxxx> writes:
>>> On 03/08/20 16:13, Thomas Gleixner wrote:
>>>> Vladimir Oltean <olteanv@xxxxxxxxx> writes:
>>>>>> 1) When irq accounting is disabled, RT throttling kicks in as
>>>>>> expected.
>>>>>>
>>>>>> 2) With irq accounting the RT throttler does not kick in and the RCU
>>>>>> stall/lockups happen.
>>>>> What is this telling us?
>>>>
>>>> It seems that the fine grained irq time accounting affects the runtime
>>>> accounting in some way which I haven't figured out yet.
>>>>
>>>
>>> With IRQ_TIME_ACCOUNTING, rq_clock_task() will always be incremented by a
>>> lesser-or-equal value than when not having the option; you start with the
>>> same delta_exec but slice some for the IRQ accounting, and leave the rest
>>> for the rq_clock_task() (+paravirt).
>>>
>>> IIUC this means that if you spend e.g. 10% of the time in IRQ and 90% of
>>> the time running the stress-ng RT tasks, despite having RT tasks hogging
>>> the entirety of the "available time" it is still only 90% runtime, which is
>>> below the 95% default and the throttling doesn't happen.
>>
>> totaltime = irqtime + tasktime
>>
>> Ignoring irqtime and pretending that totaltime is what the scheduler
>> can control and deal with is naive at best.
>>
>
> Agreed, however AFAICT rt_time is only incremented by rq_clock_task()
> deltas, which don't include IRQ time with IRQ_TIME_ACCOUNTING=y. That would
> then be directly compared to the sysctl runtime.
>
> Adding some prints in sched_rt_runtime_exceeded() and running this test
> case on my Juno, I get:
> # IRQ_TIME_ACCOUNTING=y
> cpu=2 rt_time=713455220 runtime=950000000 rq->avg_irq.util_avg=265
> (rt_time oscillates between [70.1e7, 75.1e7]; avg_irq between [220, 270])
>
> # IRQ_TIME_ACCOUNTING=n
> cpu=2 rt_time=963035300 runtime=949951811
> (rt_time oscillates between [94.1e7, 96.1e7];
>
> Throttling happens for IRQ_TIME_ACCOUNTING=n and doesn't for
> IRQ_TIME_ACCOUNTING=y - clearly the accounted rt_time isn't high enough for
> that to happen, and it does look like what is missing in rt_time (or what
> should be subtracted from the available runtime) is there in the avg_irq.

I agree that w/ IRQ_TIME_ACCOUNTING=y rt_rq->rt_time isn't high enough
in this testcase.

stress-ng-hrtim-1655 [001] 462.897733: bprint: update_curr_rt:
rt_rq->rt_time=416716900 rt_rq->rt_runtime=950000000
rt_b->rt_runtime=950000000

The 5% reservation (1 - sched_rt_runtime_us/sched_rt_period_us) for CFS
is massively eclipsed by irqtime.

It's true that avg_irq tracks 'irq_delta + steal' time but it is meant
to potentially reduce cpu capacity. It's also cpu and frequency
invariant (your CPU2 is a big CPU so no issue here).

Could a rq_clock(rq) derived rt_rq signal been used to compare against
rt_runtime?

BTW, DL already influences rt_rq->rt_time.

[...]