Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead

From: Dave Chinner
Date: Tue Jan 08 2019 - 17:27:26 EST


On Tue, Jan 08, 2019 at 11:58:26AM -0500, Waiman Long wrote:
> On 01/07/2019 09:04 PM, Dave Chinner wrote:
> > On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
> >> On 01/07/2019 05:32 PM, Dave Chinner wrote:
> >>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> > What I was suggesting is that you change the per-cpu counter
> > implementation to the /generic infrastructure/ that solves this
> > problem, and then determine if the extra update overhead is at all
> > measurable. If you can't measure any difference in update overhead,
> > then slapping complexity on the existing counter to attempt to
> > mitigate the summing overhead is the wrong solution.
> >
> > Indeed, it may be that you need o use a custom batch scaling curve
> > for the generic per-cpu coutner infrastructure to mitigate the
> > update overhead, but the fact is we already have generic
> > infrastructure that solves your problem and so the solution should
> > be "use the generic infrastructure" until it can be proven not to
> > work.
> >
> > i.e. prove the generic infrastructure is not fit for purpose and
> > cannot be improved sufficiently to work for this use case before
> > implementing a complex, one-off snowflake counter implementation...
>
> I see your point. I like the deferred summation approach that I am
> currently using. If I have to modify the current per-cpu counter
> implementation to support that

No! Stop that already. The "deferred counter summation" is exactly
the problem the current algorithm has and exactly the problem the
generic counters /don't have/. Changing the generic percpu counter
algorithm to match this specific hand rolled implementation is not
desirable as it will break implementations that rely on the
bound maximum summation deviation of the existing algorithm (e.g.
the ext4 and XFS ENOSPC accounting algorithms).

> and I probably need to add counter
> grouping support to amortize the overhead, that can be a major

The per-cpu counters already have configurable update batching
to amortise the summation update cost across multiple individual
per-cpu updates. You don't need to change the implementation at all,
just tweak the amortisation curve appropriately for the desired
requirements for update scalability (high) vs read accuracy (low).

Let's face it, on large systems where counters are frequently
updated, the resultant sum can be highly inaccurate by the time a
thousand CPU counters have been summed. The generic counters have a
predictable and bound "fast sum" maximum deviation (batch size *
nr_cpus), so over large machines are likely to be much more accurate
than "unlocked summing on demand" algorithms.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx