Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

From: Bharata B Rao
Date: Mon Feb 01 2010 - 03:22:14 EST


On Thu, Jan 28, 2010 at 08:26:08PM -0800, Paul Turner wrote:
> On Thu, Jan 28, 2010 at 7:49 PM, Bharata B Rao <bharata.rao@xxxxxxxxx> wrote:
> > On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <pjt@xxxxxxxxxx> wrote:
> >>
> >> What are your thoughts on using a separate mechanism for the general case.  A
> >> draft proposal follows:
> >>
> >> - Maintain a global run-time pool for each tg.  The runtime specified by the
> >>  user represents the value that this pool will be refilled to each period.
> >> - We continue to maintain the local notion of runtime/period in each cfs_rq,
> >>  continue to accumulate locally here.
> >>
> >> Upon locally exceeding the period acquire new credit from the global pool
> >> (either under lock or more likely using atomic ops).  This can either be in
> >> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
> >> variant with historical demand.
> >>
> >> One caveat here is that there is some over-commit in the system, the local
> >> differences of runtime vs period represent additional over the global pool.
> >> However it should not be possible to consistently exceed limits since the rate
> >> of refill is gated by the runtime being input into the system via the per-tg
> >> pool.
> >>
> >
> > We borrow from what is actually available as spare (spare = unused or
> > remaining). With global pool, I see that would be difficult.
> > Inability/difficulty in keeping the global pool in sync with the
> > actual available spare time is the reason for over-commit ?
> >
>
> We maintain two pools, a global pool (new) and a per-cfs_rq pool
> (similar to existing rt_bw).
>
> When consuming time you charge vs your local bandwidth until it is
> expired, at this point you must either refill from the global pool, or
> throttle.
>
> The "slack" in the system is the sum of unconsumed time in local pools
> from the *previous* global pool refill. This is bounded above by the
> size of time you refill a local pool at each expiry. We call the size
> of refill a 'slice'.
>
> e.g.
>
> Task limit of 50ms, slice=10ms, 4cpus, period of 500ms
>
> Task A runs on cpus 0 and 1 for 5ms each, then blocks.
>
> When A first executes on each cpu we take slice=10ms from the global
> pool of 50ms and apply it to the local rq. Execution then proceeds vs
> local pool.
>
> Current state is: 5 ms in local pools on {0,1}, 30ms remaining in global pool
>
> Upon period expiration we issue a global pool refill. At this point we have:
> 5 ms in local pools on {0,1}, 50ms remaining in global pool.
>
> That 10ms of slack time is over-commit in the system. However it
> should be clear that this can only be a local effect since over any
> period of time the rate of input into the system is limited by global
> pool refill rate.

With the same setup as above consider 5 such tasks which block after
consuming 5ms each. So now we have 25ms slack time. In the next bandwidth
period if 5 cpu hogs start running and they would consume this 25ms and the
50ms from this period. So we gave 50% extra to a group in a bandwidth period.
Just wondering how common such scenarious could be.

>
> There are also some strategies that we are exploring to improve
> behavior further here. One idea is that if we maintain a generation
> counter then on voluntary dequeue (e.g. tasks block) we can return
> local time to the global period pool or expire it (if generations
> don't match), this greatly reduces the noise (via slack vs ideal
> limit) that a busty application can induce.

Why not clear the remaining runtime during bandwidth refresh ?

>
> >> This would also naturally associate with an interface change that would mean the
> >> runtime limit for a group would be the effective cpurate within the period.
> >>
> >> e.g. by setting a runtime of 200000us on a 100000us period it would effectively
> >> allow you to use 2 cpus worth of wall-time on a multicore system.
> >>
> >> I feel this is slightly more natural than the current definition which due to
> >> being local means that values set will not result in consistent behavior across
> >> machines of different core counts.  It also has the benefit of being consistent
> >> with observed exports of time consumed, e.g. rusage, (indirectly) time, etc.
> >
> > Though runtimes are enforced locally per-cpu, that's only the
> > implementation. The definition of runtime and period is still
> > system-wide/global. A runtime/period=0.25/0.5 will mean 0.25s of
> > system wide runtime within a period of 0.5s. Talking about consistent
> > definition, I would say this consistently defines half of system wide
> > wall-time on all configurations :)
>
> This feels non-intuitive suppose you have a non-homogeneous fleet of
> systems. It is also difficult to express limits in terms of cores,
> suppose I'm an admin trying to jail my users (maybe I rent out virtual
> time ala EC2 for example). The fractions I have to use to represent
> integer core amounts are going to become quite small on large systems.
> For example, 1 core on a 64 core system would mean 3.905ms/250ms
> period. What's the dependency here between your time and the current
> cpuset topology also, if I'm only allowed on half the system does this
> fraction then refer to global resources or what I'm constrained to?
> This seems a difficult data dependency to manage.
>
> My (personal) ideology is that we are metering at the cpu level as
> opposed to the system level -- which means N seconds of cpu-time makes
> more sense to me. I feel it has advantages in that it can be
> specified more directly relative to the period and is independent of
> system partitioning.
>
> I'd be interested to hear other opinions on this.

We need a consensus here, will wait to see what others think about this.

>
> > If it means 2 CPUs worth wall-time
> > in 4 core machine, it would mean 4 CPUs on a 8 CPU machine.  At this
> > point, I am inclined to go with this and let the admins/tools work out
> > the actual CPUs part of it. However I would like to hear what others
> > think about this interface.
> >
> >>
> >> For future scalability as machine size grows this could potentially be
> >> partitioned below the tg level along the boundaries of sched_domains (or
> >> something similar).  However for an initial draft given current machine sizes
> >> the contention on the global pool should hopefully be fairly low.
> >
> > One of the alternatives I have in mind is to be more aggressive while
> > borrowing. While keeping the current algorithm (of iterating thro' all
> > CPUs when borrowing) intact, we could potentially borrow more from
> > those CPUs which don't have any running task from the given group. I
> > just experimented with borrowing half of the available runtime from
> > such CPUs and found that number of iterations are greatly reduced and
> > the source runtime quickly converges to its max possible value. Do you
> > see any issues with this ?
> >
>
> I strongly believe that this is going to induce significant lock
> contention and is not a scalable solution over time. While using a
> faster converging series for time may help I think there are
> confounding factors that limit effect here. Consider the 1 core on a
> 64 core system example above. With only 3.905ms per pool we are going
> to quickly hit the case where we are borrowing not-useful periods of
> time while thrashing locks.
>
> We are in the midst of an implementation for proposal above which
> we'll have ready post to here for consideration next week. We have
> maintained your existing approach with respect to handling throttled
> entities and layered on top of that the proposed alternate
> local/global bandwidth scheme. Initial tests show very promising
> results!

Nice. Look forward to your patches.

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/