Re: [PATCH 1/2] softirq: fix integer overflow in function show_stat()

From: Matthew Wilcox
Date: Tue Jul 25 2023 - 11:27:26 EST


On Tue, Jul 25, 2023 at 05:09:05PM +0800, Leizhen (ThunderTown) wrote:
> On 2023/7/25 10:00, Leizhen (ThunderTown) wrote:
> > On 2023/7/24 21:50, Matthew Wilcox wrote:
> >> On Mon, Jul 24, 2023 at 09:22:23PM +0800, thunder.leizhen@xxxxxxxxxxxxxxx wrote:
> >>> From: Zhen Lei <thunder.leizhen@xxxxxxxxxx>
> >>>
> >>> The statistics function of softirq is supported by commit aa0ce5bbc2db
> >>> ("softirq: introduce statistics for softirq") in 2009. At that time,
> >>> 64-bit processors should not have many cores and would not face
> >>> significant count overflow problems. Now it's common for a processor to
> >>> have hundreds of cores. Assume that there are 100 cores and 10
> >>> TIMER_SOFTIRQ are generated per second, then the 32-bit sum will be
> >>> overflowed after 50 days.
> >>
> >> 50 days is long enough to take a snapshot. You should always be using
> >> difference between, not absolute values, and understand that they can
> >> wrap. We only tend to change the size of a counter when it can wrap
> >> sufficiently quickly that we might miss a wrap (eg tens of seconds).
>
> Sometimes it can take a long time to view it again. For example, it is
> possible to run a complete business test for hours or even days, and
> then calculate the average.

I've been part of teams which have done such multi-hour tests. That
isn't how monitoring was performed. Instead snapshots were taken every
minute or even more frequently, because we wanted to know how these
counters were fluctuating during the test -- were there time periods
when the number of sortirqs spiked, or was it constant during the test?

> > Yes, I think patch 2/2 can be dropped. I reduced the number of soft
> > interrupts generated in one second, and actually 100+ or 1000 is normal.
> > But I think patch 1/2 is necessary. The sum of the output scattered values
> > does not match the output sum. To solve this problem, we only need to
> > adjust the type of a local variable.
>
> However, it is important to consider that when the local variable is changed
> to u64, the output string becomes longer. It is not clear if the user-mode
> program parses it only by u32.

There's no need for the numbers to add up. They won't anyway, because
summing them is racy , so they'll always be a little off.