Re: [PATCH v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings

From: Roman Gushchin
Date: Wed Aug 05 2020 - 23:52:01 EST


On Wed, Aug 05, 2020 at 08:01:33PM -0700, Hugh Dickins wrote:
> On Mon, 3 Aug 2020, Roman Gushchin wrote:
> > On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote:
> > > On Fri, 31 Jul 2020, Roman Gushchin wrote:
> > > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > > > >
> > > > > Though another alternative did occur to me overnight: we could
> > > > > scrap the logged warning, and show "nr_whatever -53" as output
> > > > > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > > > > to me, and you redirect to /dev/null.
> > > >
> > > > It sounds like a good idea to me. Do you want me to prepare a patch?
> > >
> > > Yes, if you like that one best, please do prepare a patch - thanks!
> >
> > Hi Hugh,
> >
> > I mastered a patch (attached below), but honestly I can't say I like it.
> > The resulting interface is confusing: we don't generally use sysctls to
> > print debug data and/or warnings.
>
> Since you confessed to not liking it yourself, I paid it very little
> attention. Yes, when I made that suggestion, I wasn't really thinking
> of how stat_refresh is a /proc/sys/vm sysctl thing; and I'm not at all
> sure how issuing output from a /proc file intended for input works out
> (perhaps there are plenty of good examples, and you followed one, but
> it smells fishy to me now).
>
> >
> > I thought about treating a write to this sysctls as setting the threshold,
> > so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative
> > entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold
> > as in my patch. But this breaks to some extent the current ABI, as passing
> > an incorrect value will result in -EINVAL instead of passing (as now).
>
> I expect we could handle that well enough, by more lenient validation
> of the input; though my comment above on output versus input sheds doubt.
>
> >
> > Overall I still think we shouldn't warn on any values inside the possible
> > range, as it's not an indication of any kind of error. The only reason
> > why we see some values going negative and some not, is that some of them
> > are updated more frequently than others, and some are bouncing around
> > zero, while other can't reach zero too easily (like the number of free pages).
>
> We continue to disagree on that (and it amuses me that you who are so
> sure they can be ignored, cannot ignore them; whereas I who am so curious
> to investigate them, have not actually found the time to do so in years).
> It was looking as if nothing could satisfy us both, but...

I can only repeat my understanding here: with the current implementation
the measured number can vary in range of
(true_value - zone_threshold * NR_CPUS,
true_value + zone_threshold * NR_CPUS).
zone_threshold depends on the size of a zone and the number of CPUs,
but cannot exceed 125.

Of course, most likely measured numbers are mostly distributed somewhere
close to the real number, and reaching distant ends of this range is
unlikely. But it's a question of probability.

So if the true value is close to 0, there are high chances of getting
negative measured numbers. The bigger is the value, the lower are these
chances. And if it's bigger than the maximal drift, these chances are 0.

So we can be sure that a measured value can't go negative only if we know
for sure that the true number is bigger than zone_threshold * NR_CPUS.

You can, probably, say that if the chances of getting a negative value
are really really low, it's better to spawn a warning, rather than miss
a potential error. I'd happily agree, if we'd have a nice formula
to calculate the tolerance by the given probability. But if we'll treat
all negative numbers as warnings, we'll just end with a lot of false
warnings.

>
> >
> > Actually, if someone wants to ensure that numbers are accurate,
> > we have to temporarily set the threshold to 0, then flush the percpu data
> > and only then check atomics. In the current design flushing percpu data
> > matters for only slowly updated counters, as all others will run away while
> > we're waiting for the flush. So if we're targeting some slowly updating
> > counters, maybe we should warn only on them being negative, Idk.
>
> I was going to look into that angle, though it would probably add a little
> unjustifiable overhead to fast paths, and be rejected on that basis.

I'd expect it. What I think can be acceptable is to have different tolerance
for different counters, if there is a good reason to have more precise values
for some counters.
I did a similar thing in the "new slab controller" patchset for memcg
slab statistics, which required a different threshold because they are measured
in bytes (all other metrics were historically in pages).

>
> But in going to do so, came up against an earlier comment of yours, of
> which I had misunderstood the significance. I had said and you replied:
>
> > > nr_zone_write_pending: yes, I've looked at our machines, and see that
> > > showing up for us too (-49 was the worst I saw). Not at all common,
> > > but seen. And not followed by increasingly worse numbers, so a state
> > > that corrects itself. nr_dirty too (fewer instances, bigger numbers);
> > > but never nr_writeback, which you'd expect to go along with those.
> >
> > NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them?
>
> Wow. Now I see what you were pointing out: when v4.8's 75ef71840539
> ("mm, vmstat: add infrastructure for per-node vmstats") went in, it
> missed updating vmstat_refresh() to check all the NR_VM_NODE_STAT items.
>
> And I've never noticed, and have interpreted its silence on those items
> as meaning they're all good (and the nr_dirty ones I mentioned above,
> must have been from residual old kernels, hence the fewer instances).
> I see the particularly tricky NR_ISOLATED ones are in that category.
> Maybe they are all good, but I have been mistaken.
>
> I shall certainly want to reintroduce those stats to checking for
> negatives, even if it's in a patch that never earns your approval,
> and just ends up kept internal for debugging. But equally certainly,
> I must not suddenly reintroduce that checking without gaining some
> experience of it (and perhaps getting as irritated as you by more
> transient negatives).
>
> I said earlier that I'd prefer you to rip out all that checking for
> negatives, rather than retaining it with the uselessly over-generous
> 125 * nr_cpus leeway. Please, Roman, would you send Andrew a patch
> doing that, to replace the patch in this thread? Or if you prefer,
> I can do so.

Sure, I can do it. But can you, please, explain why you want them to be
eliminated? Is this because they will never fire, you think?

In my humble opinion they might be quite useful: any systematic under- or
overflow will eventually trigger this warning, and we'll know for sure that
something is wrong. So I'd add a similar check for node counters without
any hesitation.

I wouldn't oppose such a patch anyway, just want to make sure we are
on the same page here.

Thanks!