Re: [PATCH v8 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING

From: Yafang Shao
Date: Thu Jan 09 2025 - 08:50:18 EST


On Thu, Jan 9, 2025 at 6:47 PM Michal Koutný <mkoutny@xxxxxxxx> wrote:
>
> Hello Yafang.
>
> I consider the runtimization you did in the first three patches
> sensible,

The first three patches can be considered as a separate series and are
not directly related to this issue.

Hello Peter,

Since the first three patches have already received many reviews,
could you please apply them first?

> however, this fourth patch is a hard sell.

Seems that way. I'll do my best to advocate for it.

>
> On Fri, Jan 03, 2025 at 10:24:09AM +0800, Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > However, despite adding more threads to handle an increased workload,
> > the CPU usage could not be raised.
>
> (Is that behavior same in both CONFIG_IRQ_TIME_ACCOUNTING and
> !CONFIG_IRQ_TIME_ACCOUNTING cases?)

No, in the case of !CONFIG_IRQ_TIME_ACCOUNTING, CPU usage will
increase as we add more workloads. In other words, this is a
user-visible behavior change, and we should aim to avoid it.

>
> > In other words, even though the container’s CPU usage appeared low, it
> > was unable to process more workloads to utilize additional CPU
> > resources, which caused issues.
>
> Hm, I think this would be worth documenting in the context of
> CONFIG_IRQ_TIME_ACCOUNTING and irq.pressure.

Document it as follows?

"Enabling CONFIG_IRQ_TIME_ACCOUNTING will exclude IRQ usage from the
CPU usage of your tasks. In other words, your task's CPU usage will
only reflect user time and system time."

If we document it clearly this way, I believe no one will try to enable it ;-)

>
> > The CPU usage of the cgroup is relatively low at around 55%, but this usage
> > doesn't increase, even with more netperf tasks. The reason is that CPU0 is
> > at 100% utilization, as confirmed by mpstat:
> >
> > 02:56:22 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > 02:56:23 PM 0 0.99 0.00 55.45 0.00 0.99 42.57 0.00 0.00 0.00 0.00
> >
> > 02:56:23 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > 02:56:24 PM 0 2.00 0.00 55.00 0.00 0.00 43.00 0.00 0.00 0.00 0.00
> >
> > It is clear that the %soft is excluded in the cgroup of the interrupted
> > task. This behavior is unexpected. We should include IRQ time in the
> > cgroup to reflect the pressure the group is under.
>
> What is irq.pressure shown in this case?

$ cat irq.pressure ; sleep 1; cat irq.pressure
full avg10=42.96 avg60=39.31 avg300=17.44 total=66518335
full avg10=42.96 avg60=39.31 avg300=17.44 total=66952730

It seems that after a certain point, avg10 is almost identical to
%irq. However, can we simply add avg10 to the CPU utilization? I don't
believe we can.

>
> > The system metric in cpuacct.stat is crucial in indicating whether a
> > container is under heavy system pressure, including IRQ/softirq activity.
> > Hence, IRQ/softirq time should be included in the cpuacct system usage,
> > which also applies to cgroup2’s rstat.
>
> But this only works for you where cgroup's workload induces IRQ on
> itself but generally it'd be confusing (showing some sys time that
> originates out of the cgroup). irq.pressure covers this universally (or
> it should).

It worked well before the introduction of CONFIG_IRQ_TIME_ACCOUNTING.
Why not just maintain the previous behavior, especially since it's not
difficult to do so?

>
> On Fri, Jan 03, 2025 at 10:24:05AM +0800, Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > CPU utilization calculations. What's worse, there is no effective way to
> > add the irq time back into the CPU utilization based on current
> > available metrics. Therefore, we have to change the kernel code.
>
> That's IMO what irq.pressure (PSI) should be useful for. Adjusting
> cgroup's CPU usage with irq.pressue (perhaps not as simple as
> multiplication, Johannes may step in here) should tell you info for load
> balancer.

We’re unsure how to use this metric to guide us, and I don't think
there will be clear guidance on how irq.pressure relates to CPU
utilization. :(


--
Regards
Yafang