Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings

From: Hugh Dickins
Date: Wed Aug 05 2020 - 23:02:00 EST

On Mon, 3 Aug 2020, Roman Gushchin wrote:
> On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote:
> > On Fri, 31 Jul 2020, Roman Gushchin wrote:
> > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > > >
> > > > Though another alternative did occur to me overnight: we could
> > > > scrap the logged warning, and show "nr_whatever -53" as output
> > > > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > > > to me, and you redirect to /dev/null.
> > >
> > > It sounds like a good idea to me. Do you want me to prepare a patch?
> >
> > Yes, if you like that one best, please do prepare a patch - thanks!
> Hi Hugh,
> I mastered a patch (attached below), but honestly I can't say I like it.
> The resulting interface is confusing: we don't generally use sysctls to
> print debug data and/or warnings.

Since you confessed to not liking it yourself, I paid it very little
attention. Yes, when I made that suggestion, I wasn't really thinking
of how stat_refresh is a /proc/sys/vm sysctl thing; and I'm not at all
sure how issuing output from a /proc file intended for input works out
(perhaps there are plenty of good examples, and you followed one, but
it smells fishy to me now).

> I thought about treating a write to this sysctls as setting the threshold,
> so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative
> entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold
> as in my patch. But this breaks to some extent the current ABI, as passing
> an incorrect value will result in -EINVAL instead of passing (as now).

I expect we could handle that well enough, by more lenient validation
of the input; though my comment above on output versus input sheds doubt.

> Overall I still think we shouldn't warn on any values inside the possible
> range, as it's not an indication of any kind of error. The only reason
> why we see some values going negative and some not, is that some of them
> are updated more frequently than others, and some are bouncing around
> zero, while other can't reach zero too easily (like the number of free pages).

We continue to disagree on that (and it amuses me that you who are so
sure they can be ignored, cannot ignore them; whereas I who am so curious
to investigate them, have not actually found the time to do so in years).
It was looking as if nothing could satisfy us both, but...

> Actually, if someone wants to ensure that numbers are accurate,
> we have to temporarily set the threshold to 0, then flush the percpu data
> and only then check atomics. In the current design flushing percpu data
> matters for only slowly updated counters, as all others will run away while
> we're waiting for the flush. So if we're targeting some slowly updating
> counters, maybe we should warn only on them being negative, Idk.

I was going to look into that angle, though it would probably add a little
unjustifiable overhead to fast paths, and be rejected on that basis.

But in going to do so, came up against an earlier comment of yours, of
which I had misunderstood the significance. I had said and you replied:

> > nr_zone_write_pending: yes, I've looked at our machines, and see that
> > showing up for us too (-49 was the worst I saw). Not at all common,
> > but seen. And not followed by increasingly worse numbers, so a state
> > that corrects itself. nr_dirty too (fewer instances, bigger numbers);
> > but never nr_writeback, which you'd expect to go along with those.
> NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them?

Wow. Now I see what you were pointing out: when v4.8's 75ef71840539
("mm, vmstat: add infrastructure for per-node vmstats") went in, it
missed updating vmstat_refresh() to check all the NR_VM_NODE_STAT items.

And I've never noticed, and have interpreted its silence on those items
as meaning they're all good (and the nr_dirty ones I mentioned above,
must have been from residual old kernels, hence the fewer instances).
I see the particularly tricky NR_ISOLATED ones are in that category.
Maybe they are all good, but I have been mistaken.

I shall certainly want to reintroduce those stats to checking for
negatives, even if it's in a patch that never earns your approval,
and just ends up kept internal for debugging. But equally certainly,
I must not suddenly reintroduce that checking without gaining some
experience of it (and perhaps getting as irritated as you by more
transient negatives).

I said earlier that I'd prefer you to rip out all that checking for
negatives, rather than retaining it with the uselessly over-generous
125 * nr_cpus leeway. Please, Roman, would you send Andrew a patch
doing that, to replace the patch in this thread? Or if you prefer,
I can do so.