Re: [PATCH 7/8] wbt: add general throttling mechanism

From: Jan Kara
Date: Tue May 03 2016 - 05:34:22 EST


On Thu 28-04-16 12:53:50, Jens Axboe wrote:
> >2) As far as I can see in patch 8/8, you have plugged the throttling above
> > the IO scheduler. When there are e.g. multiple cgroups with different IO
> > limits operating, this throttling can lead to strange results (like a
> > cgroup with low limit using up all available background "slots" and thus
> > effectively stopping background writeback for other cgroups)? So won't
> > it make more sense to plug this below the IO scheduler? Now I understand
> > there may be other problems with this but I think we should put more
> > though to that and provide some justification in changelogs.
>
> One complexity is that we have to do this early for blk-mq, since once you
> get a request, you're already sitting on the hw tag. CoDel should actually
> work fine at each hop, so hopefully this will as well.

OK, I see. But then this suggests that any IO scheduling and / or
cgroup-related throttling should happen before we get a request for blk-mq
as well? And then we can still do writeback throttling below that layer?

> But yes, fairness is something that we have to pay attention to. Right now
> the wait queue has no priority associated with it, that should probably be
> improved to be able to wakeup in a more appropriate order.
> Needs testing, but hopefully it works out since if you do run into
> starvation, then you'll go to the back of the queue for the next attempt.

Yeah, once I'll hunt down that regression with old disk, I can have a look
into how writeback throttling plays together with blkio-controller.

> >>+static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
> >>+{
> >>+ u64 thislat;
> >>+
> >>+ /*
> >>+ * If our stored sync issue exceeds the window size, or it
> >>+ * exceeds our min target AND we haven't logged any entries,
> >>+ * flag the latency as exceeded.
> >>+ */
> >>+ thislat = rwb_sync_issue_lat(rwb);
> >>+ if (thislat > rwb->cur_win_nsec ||
> >>+ (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
> >>+ trace_wbt_lat(rwb->bdi, thislat);
> >>+ return LAT_EXCEEDED;
> >>+ }
> >
> >So I'm trying to wrap my head around this. If I read the code right,
> >rwb_sync_issue_lat() this returns time that has passed since issuing sync
> >request that is still running. We basically randomly pick which sync
> >request we track as we always start tracking a sync request when some is
> >issued and we are not tracking any at that moment. This is to detect the
> >case when latency of sync IO is very large compared to measurement window
> >so we would not get enough samples to make it valid?
>
> Right, that's pretty close. Since wbt uses the completion latencies to make
> decisions, if an IO hasn't completed, we don't know about it. If the device
> is flooded with writes, and we then issue a read, maybe that read won't
> complete for multiple monitoring windows. During that time, we keep thinking
> everything is fine. But in reality, it's not completing because of the write
> load. So this logic attempts to track the single sync IO request case. If
> that exceeds a monitoring window of time and we saw no other sync IO in that
> window, then treat that case as if it had completed but exceeded the min
> latency. And then scale back.
>
> We'll always treat a state sample with 1 read as valuable, but for this
> case, we don't have that sample until it completes.
>
> Does that make more sense?

OK, makes sense. Thanks for explanation.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR