Re: [PATCH] backing_dev_info: introduce min_bw/max_bw limits

From: Jan Kara
Date: Tue Jun 22 2021 - 08:12:23 EST

On Mon 21-06-21 11:20:10, Michael Stapelberg wrote:
> Hey Miklos
> On Fri, 18 Jun 2021 at 16:42, Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> >
> > On Fri, 18 Jun 2021 at 10:31, Michael Stapelberg
> > <stapelberg+linux@xxxxxxxxxx> wrote:
> >
> > > Maybe, but I don’t have the expertise, motivation or time to
> > > investigate this any further, let alone commit to get it done.
> > > During our previous discussion I got the impression that nobody else
> > > had any cycles for this either:
> > >
> > >
> > > Have you had a look at the China LSF report at
> > >
> > > The author of the heuristic has spent significant effort and time
> > > coming up with what we currently have in the kernel:
> > >
> > > """
> > > Fengguang said he draw more than 10K performance graphs and read even
> > > more in the past year.
> > > """
> > >
> > > This implies that making changes to the heuristic will not be a quick fix.
> >
> > Having a piece of kernel code sitting there that nobody is willing to
> > fix is certainly not a great situation to be in.
> Agreed.
> >
> > And introducing band aids is not going improve the above situation,
> > more likely it will prolong it even further.
> Sounds like “Perfect is the enemy of good” to me: you’re looking for a
> perfect hypothetical solution,
> whereas we have a known-working low risk fix for a real problem.
> Could we find a solution where medium-/long-term, the code in question
> is improved,
> perhaps via a Summer Of Code project or similar community efforts,
> but until then, we apply the patch at hand?
> As I mentioned, I think adding min/max limits can be useful regardless
> of how the heuristic itself changes.
> If that turns out to be incorrect or undesired, we can still turn the
> knobs into a no-op, if removal isn’t an option.

Well, removal of added knobs is more or less out of question as it can
break some userspace. Similarly making them no-op is problematic unless we
are pretty certain it cannot break some existing setup. That's why we have
to think twice (or better three times ;) before adding any knobs. Also
honestly the knobs you suggest will be pretty hard to tune when there are
multiple cgroups with writeback control involved (which can be affected by
the same problems you observe as well). So I agree with Miklos that this is
not the right way to go. Speaking of tunables, did you try tuning
/sys/devices/virtual/bdi/<fuse-bdi>/min_ratio? I suspect that may
workaround your problems...

Looking into your original report and tracing you did (thanks for that,
really useful), it seems that the problem is that writeback bandwidth is
updated at most every 200ms (more frequent calls are just ignored) and are
triggered only from balance_dirty_pages() (happen when pages are dirtied) and
inode writeback code so if the workload tends to have short spikes of activity
and extended periods of quiet time, then writeback bandwidth may indeed be
seriously miscomputed because we just won't update writeback throughput
after most of writeback has happened as you observed.

I think the fix for this can be relatively simple. We just need to make
sure we update writeback bandwidth reasonably quickly after the IO
finishes. I'll write a patch and see if it helps.

Jan Kara <jack@xxxxxxxx>