Re: [RFC 2/3] blk-throttling: weight based throttling

From: Shaohua Li
Date: Thu Jan 21 2016 - 16:01:08 EST


On Thu, Jan 21, 2016 at 03:33:32PM -0500, Vivek Goyal wrote:
> On Wed, Jan 20, 2016 at 09:49:18AM -0800, Shaohua Li wrote:
> > We know total bandwidth of a disk and can calculate cgroup's bandwidth
> > percentage against disk bandwidth according to its weight. We can easily
> > calculate cgroup bandwidth.
> >
> > Signed-off-by: Shaohua Li <shli@xxxxxx>
> > ---
> > block/blk-throttle.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 134 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index 2149a1d..b3f847d 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -12,6 +12,9 @@
> > #include <linux/blk-cgroup.h>
> > #include "blk.h"
> >
> > +#define MAX_WEIGHT (1000)
> > +#define WEIGHT_RATIO_SHIFT (12)
> > +#define WEIGHT_RATIO (1 << WEIGHT_RATIO_SHIFT)
> > /* Max dispatch from a group in 1 round */
> > static int throtl_grp_quantum = 8;
> >
> > @@ -74,6 +77,10 @@ struct throtl_service_queue {
> > unsigned int nr_pending; /* # queued in the tree */
> > unsigned long first_pending_disptime; /* disptime of the first tg */
> > struct timer_list pending_timer; /* fires on first_pending_disptime */
> > +
> > + unsigned int weight;
> > + unsigned int children_weight;
> > + unsigned int ratio;
>
> Will it be better to call it "share" instead of "ratio". It is basically
> a measure of % disk share of the group and share seems more intuitive.

Ok
>
> [..]
> > +static void tg_update_bps(struct throtl_grp *tg)
> > +{
> > + struct throtl_service_queue *sq, *parent_sq;
> > +
> > + sq = &tg->service_queue;
> > + parent_sq = sq->parent_sq;
> > +
> > + if (!tg->td->weight_based || !parent_sq)
> > + return;
> > + sq->ratio = max_t(unsigned int,
> > + parent_sq->ratio * sq->weight / parent_sq->children_weight,
> > + 1);
> > +
>
> It might be good to decouple updation of "share/ratio" and updation of
> bps. Change of share can happen any time either weight is changed or
> an active group is queue/dequeued and we don't have to do it every time
> a bio is submitted.

Ok
> > + tg->bps[READ] = max_t(uint64_t,
> > + (queue_bandwidth(tg->td, READ) * sq->ratio) >>
> > + WEIGHT_RATIO_SHIFT,
> > + 1024);
> > + tg->bps[WRITE] = max_t(uint64_t,
> > + (queue_bandwidth(tg->td, WRITE) * sq->ratio) >>
> > + WEIGHT_RATIO_SHIFT,
> > + 1024);
> > +}
> > +
> > +static void tg_update_ratio(struct throtl_grp *tg)
> > +{
> > + struct throtl_data *td = tg->td;
> > + struct cgroup_subsys_state *pos_css;
> > + struct blkcg_gq *blkg;
> > +
> > + blkg_for_each_descendant_pre(blkg, pos_css, td->queue->root_blkg) {
>
> Is it possible to traverse only the affected subtree instead of whole
> tree of groups. Because if weight is updated on a group, then we just
> need to traverse the subtree under that group's parent.

makes sense
> [..]
> > @@ -1415,6 +1546,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
> >
> > sq = &tg->service_queue;
> >
> > + tg_update_bps(tg);
>
> Updating bps for every bio submitted sounds like a lot. We probably could
> do it when first bio gets queued in the group and then refresh it at
> some regular interval. Say when next set of dispatch happens from group
> we could update bandwidth of group after dispatch.

That calculation isn't very heavy. I'll revisit this if it's a problem.

Thanks,
Shaohua