Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
From: Phil Auld
Date: Mon Oct 07 2019 - 09:02:44 EST
Hi Xuewei,
On Fri, Oct 04, 2019 at 05:28:15PM -0700 Xuewei Zhang wrote:
> On Fri, Oct 4, 2019 at 6:14 AM Phil Auld <pauld@xxxxxxxxxx> wrote:
> >
> > On Thu, Oct 03, 2019 at 07:05:56PM -0700 Xuewei Zhang wrote:
> > > +cc neelnatu@xxxxxxxxxx and haoluo@xxxxxxxxxx, they helped a lot
> > > for this issue. Sorry I forgot to include them when sending out the patch.
> > >
> > > On Thu, Oct 3, 2019 at 5:55 PM Phil Auld <pauld@xxxxxxxxxx> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> > > > > quota/period ratio is used to ensure a child task group won't get more
> > > > > bandwidth than the parent task group, and is calculated as:
> > > > > normalized_cfs_quota() = [(quota_us << 20) / period_us]
> > > > >
> > > > > If the quota/period ratio was changed during this scaling due to
> > > > > precision loss, it will cause inconsistency between parent and child
> > > > > task groups. See below example:
> > > > > A userspace container manager (kubelet) does three operations:
> > > > > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> > > > > 2) Create a few children cgroups.
> > > > > 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> > > > >
> > > > > These operations are expected to succeed. However, if the scaling of
> > > > > 147/128 happens before step 3), quota and period of the parent cgroup
> > > > > will be changed:
> > > > > new_quota: 1148437ns, 1148us
> > > > > new_period: 11484375ns, 11484us
> > > > >
> > > > > And when step 3) comes in, the ratio of the child cgroup will be 104857,
> > > > > which will be larger than the parent cgroup ratio (104821), and will
> > > > > fail.
> > > > >
> > > > > Scaling them by a factor of 2 will fix the problem.
> > > >
> > > > I have no issues with the concept. We went around a bit about the actual
> > > > numbers and made it an approximation.
> > > >
> > > > >
> > > > > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> > > > > Signed-off-by: Xuewei Zhang <xueweiz@xxxxxxxxxx>
> > > > > ---
> > > > > kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
> > > > > 1 file changed, 22 insertions(+), 14 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index 83ab35e2374f..b3d3d0a231cd 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > > > > if (++count > 3) {
> > > > > u64 new, old = ktime_to_ns(cfs_b->period);
> > > > >
> > > > > - new = (old * 147) / 128; /* ~115% */
> > > > > - new = min(new, max_cfs_quota_period);
> > > > > -
> > > > > - cfs_b->period = ns_to_ktime(new);
> > > > > -
> > > > > - /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
> > > > > - cfs_b->quota *= new;
> > > > > - cfs_b->quota = div64_u64(cfs_b->quota, old);
> > > > > -
> > > > > - pr_warn_ratelimited(
> > > > > - "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > > > > - smp_processor_id(),
> > > > > - div_u64(new, NSEC_PER_USEC),
> > > > > - div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > > + /*
> > > > > + * Grow period by a factor of 2 to avoid lossing precision.
> > > > > + * Precision loss in the quota/period ratio can cause __cfs_schedulable
> > > > > + * to fail.
> > > > > + */
> > > > > + new = old * 2;
> > > > > + if (new < max_cfs_quota_period) {
> > > >
> > > > I don't like this part as much. There may be a value between
> > > > max_cfs_quota_period/2 and max_cfs_quota_period that would get us out of
> > > > the loop. Possibly in practice it won't matter but here you trigger the
> > > > warning and take no action to keep it from continuing.
> > > >
> > > > Also, if you are actually hitting this then you might want to just start at
> > > > a higher but proportional quota and period.
> > >
> > > I'd like to do what you suggested. A quick idea would be to scale period to
> > > max_cfs_quota_period, and scale quota proportionally. However the naive
> > > implementation won't work under this edge case:
> > > original:
> > > quota: 500,000us period: 570,000us
> > > after scaling:
> > > quota: 877,192us period: 1,000,000us
> > > original ratio: 919803
> > > new ratio: 919802
> > >
> > > To do this right, the code would have to keep an eye out on the precision loss,
> > > and increase quota by 1us sometimes to cancel out the precision loss.
> > >
> > > Also, I think this case is not that important. Because if we are
> > > hitting this case, that
> > > suggests the period is already >0.5s. And if we are still hitting
> > > timeouts with a 0.5s
> > > period, scaling it to 1s probably won't help much.
> > > When this happens, I'd imagine the parent cgroup would have a LOT of child
> > > cgroups. It might make sense for the userspace to create the parent cgroup with
> > > 1s period.
> > >
> > > If you think automatically scaling 0.5s+ to 1s is still important, I'm
> > > happy to stash
> > > this patch, and send in another one that handles the 0.5+s -> 1s
> > > scaling the right
> > > way. :) Thanks!
> >
> > First let me understand your use case better. I was thinking about this more last
> > night and it doesn't make sense.
> >
> > You are setting a small quota and period on the parent cgroup and then setting the
> > same small quota and period on the child. As you say to keep the child from getting
> > more quota than the parent. But that should already be the case simply by setting
> > it on the parent. The child can't get more quota than the parent. All this does
> > is make the kernel do more work handling more period timers and such.
>
> Sorry for not being clear enough. Let me provide a bit more additional context:
>
> kubelet [1] is the userspace program setting the cfs quota and period.
> kubelet is essentially a container manager for the end user. The end user
> can specify any attainable configurations for a pod (which contains multiple
> containers).
>
> The user interface of kubelet allows end user to specify the amount of CPU
> granted to any pod or container (in the form of mCPU). And then kubelet will
> convert the spec to quota/period accepted by cgroup fs, using this rule:
> the period of any pod/container will be set to 100000us
> the quota of the pod/container will be calculated using the allowed mCPU
>
> And kubelet simply then writes the calculated period and quota to cgroup fs.
>
> It's very common to specify a pod with multiple containers, and setting
> different quota for the child containers: some granted with 5-50% of the
> bandwidth available to the parent, while some other granted with 100%. For
> simplicity, kubelet writes quota/period to cgroup fs for all pods and
> containers.
>
Thanks for the details :)
> ----
> Now back to our discussion. :)
>
> You see, the reason that kubelet write identical quota and period to parent and
> child cgroup, is not because it want to enforce that child doesn't get more
> quota than parent. It is simply because kubelet needs to manage the quota for
> all containers and pods, and it's more convenenient to just set the quota and
> period for all of them (because in many cases, child cgroups actually gets less
> bandwidth than the parent, and has to be set specifically).
>
> I agree that your suggestion would work. If a child cgroup is set to the same
> bandwidth of the parent cgroup, we could change the userspace program, and ask
> it to skip setting the child cgroup bandwidth.
> However, this logic would be a special case, and will require significant logic
> change to the userspace container managers.
>
>
> This issue is affecting many Kubernetes users, see this open issue:
> https://github.com/kubernetes/kubernetes/issues/72878
> kubelet on their machines are doing the three operations mentioned in the patch.
> I also explained them in more detail in this doc:
> https://docs.google.com/document/d/13KLD__6A935igLXpTFFomqfclATC89nAkhPOxsuKA0I/edit?usp=sharing
>
> Basically, Kubernetes is operating on the below assumption of kernel today:
> Setting the cpu quota/period of a child cgroup should not be rejected unless
> the bandwidth is exceeding what the quota/period set for the parent cgroup.
>
> I think this assumption is fair. Please let me know if you think otherwise. And
> if so, since the kernel broke this assumption today, I don't think it's the
> responsibility for the userspace to deal with the problem that kernel may change
> the quota/period ratio at any time.
>
> [1] https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet
>
Okay. I'm on board with this. At your starting values you'll get 1,2,4,800ms before
hitting max. That should be enough. I'm a little surprised you're hitting it even
at 100ms but it sounds like you have a lot of children. And if they have their own
settings that could be taking longer. I suspect contention on the cfs_b->lock could
be adding to it.
I do think that setup is wasting kernel cpu cycles but that's a somewhat orthagonal
discussion :)
> >
> > Setting the child quota/period only makes sense when setting it smaller than
> > the parent.
>
> As mentioned above, in the use case of kublet, it's much easier to always
> set the child quota/period, than to only set it when it is different
> (i.e. smaller)
> than the parent.
>
> >
> > Also, in order to hit this problem you need to have many hundreds of children, in
> > my experience. In that case it makes even less sense to write the same quota/preiod
> > as the parent into each of the children.
>
> Here is a problematic scenario:
> The parent cgroup have 1000 children with a small quota/period, and after a
> few minutes, kubelet wants to add one additional child with the same
> quota/period.
> This bug could prevent kubelet from setting that one additional child
> successfully.
>
>
> Thanks a lot for taking time reviewing and responding the patch Phil!
> Really appreciate it.
>
Sure thing. Thanks for tracking it down. I'll try to test this on my original
reproducer when I have a chance. I don't foresee any issues though, so for now:
Acked-by: Phil Auld <pauld@xxxxxxxxxx>
Cheers,
Phil
> Best regards,
> Xuewei
>
> >
> > Or there is something else causing the timer to take too long to run...
> >
> >
> > I agree that if we are taking > 1/2s to run do_sched_cfs_period_timer() it may
> > not matter, as I said above.
> >
> >
> > Cheers,
> > Phil
> >
> > >
> > > Best regards,
> > > Xuewei
> > >
> > > >
> > > >
> > > > Cheers,
> > > > Phil
> > > >
> > > > > + cfs_b->period = ns_to_ktime(new);
> > > > > + cfs_b->quota *= 2;
> > > > > +
> > > > > + pr_warn_ratelimited(
> > > > > + "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > > > > + smp_processor_id(),
> > > > > + div_u64(new, NSEC_PER_USEC),
> > > > > + div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > > + } else {
> > > > > + pr_warn_ratelimited(
> > > > > + "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > > > > + smp_processor_id(),
> > > > > + div_u64(old, NSEC_PER_USEC),
> > > > > + div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > > + }
> > > > >
> > > > > /* reset count so we don't come right back in here */
> > > > > count = 0;
> > > > > --
> > > > > 2.23.0.581.g78d2f28ef7-goog
> > > > >
> > > >
> > > > --
> >
> > --
--