Re: [PATCH 00/14][V5] Introduce io.latency io controller for cgroups

From: Josef Bacik
Date: Mon Jul 02 2018 - 17:38:15 EST

On Mon, Jul 02, 2018 at 02:26:39PM -0700, Andrew Morton wrote:
> On Fri, 29 Jun 2018 15:25:28 -0400 Josef Bacik <josef@xxxxxxxxxxxxxx> wrote:
> > This series adds a latency based io controller for cgroups. It is based on the
> > same concept as the writeback throttling code, which is watching the overall
> > total latency of IO's in a given window and then adjusting the queue depth of
> > the group accordingly. This is meant to be a workload protection controller, so
> > whoever has the lowest latency target gets the preferential treatment with no
> > thought to fairness or proportionality. It is meant to be work conserving, so
> > as long as nobody is missing their latency targets the disk is fair game.
> >
> > We have been testing this in production for several months now to get the
> > behavior right and we are finally at the point that it is working well in all of
> > our test cases. With this patch we protect our main workload (the web server)
> > and isolate out the system services (chef/yum/etc). This works well in the
> > normal case, smoothing out weird request per second (RPS) dips that we would see
> > when one of the system services would run and compete for IO resources. This
> > also works incredibly well in the runaway task case.
> >
> > The runaway task usecase is where we have some task that slowly eats up all of
> > the memory on the system (think a memory leak). Previously this sort of
> > workload would push the box into a swapping/oom death spiral that was only
> > recovered by rebooting the box. With this patchset and proper configuration of
> > the memory.low and io.latency controllers we're able to survive this test with a
> > at most 20% dip in RPS.
> Is this purely useful for spinning disks, or is there some
> applicability to SSDs and perhaps other storage devices? Some
> discussion on this topic would be useful.

Yes we're using this on SSDs and spinning rust, it would work on all storage
devices, you just have to adjust your latency targets accordingly.

> Patches 5, 7 & 14 look fine to me - go wild. #14 could do with a
> couple of why-we're-doing-this comments, but I say that about
> everything ;)

So that one was fun. Our test has the main workload going in the protected
group, and all the system specific stuff in an unprotected group and then we run
a memory hog in the system group. Obviously this results in everybody dumping
all caches first, including pages for the binaries themselves. Then when the
applications go to run they incur a page fault, which trips readahead. If we're
throttling this means we'll sit in the page fault handler for a good long while.
Who cares right? Well apparently the main workload cares, because it talks to
some daemon about the current memory on the system so it can make intelligent
adjustments on its allocation strategies. The daemon it talks to also gathers a
bunch of other statistics, and does things like 'ps' which goes and walks
/proc/<pid>, which has entries that wait on mmap_sem. So suddenly being block
in readahead means we have weird latency spikes because we're holding the
mmap_sem the whole time. So instead we want to just skip readahead so we are
getting throttled as little as possible while holding our mmap_sem.

The inflight READA bio's also need to be aborted, and I have a patch for that as
well, but it depends on Jens' READA abort patches that he's still working on, so
that part will come after his stuff is ready. Thanks,