Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

From: Andrew Jones
Date: Mon Sep 17 2012 - 09:49:04 EST


On Sat, Sep 15, 2012 at 09:38:54PM +0530, Raghavendra K T wrote:
> On 09/14/2012 10:40 PM, Andrew Jones wrote:
> >On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote:
> >>On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> >>>* Andrew Theurer<habanero@xxxxxxxxxxxxxxxxxx> [2012-09-11 13:27:41]:
> >>>
> [...]
> >>
> >>On picking a better vcpu to yield to: I really hesitate to rely on
> >>paravirt hint [telling us which vcpu is holding a lock], but I am not
> >>sure how else to reduce the candidate vcpus to yield to. I suspect we
> >>are yielding to way more vcpus than are prempted lock-holders, and that
> >>IMO is just work accomplishing nothing. Trying to think of way to
> >>further reduce candidate vcpus....
> >>
> >
> >wrt to yielding to vcpus for the same cpu, I recently noticed that
> >there's a bug in yield_to_task_fair. yield_task_fair() calls
> >clear_buddies(), so if we're yielding to a task that has been running on
> >the same cpu that we're currently running on, and thus is also on the
> >current cfs runqueue, then our 'who to pick next' hint is getting cleared
> >right after we set it.
> >
> >I had hoped that the patch below would show a general improvement in the
> >vpu overcommit performance, however the results were variable - no worse,
> >no better. Based on your results above showing good improvement from
> >interleaving vcpus across the cpus, then that means there was a decent
> >percent of these types of yields going on. So since the patch didn't
> >change much that indicates that the next hinting isn't generally taken
> >too seriously by the scheduler. Anyway, the patch should correct the
> >code per its design, and testing shows that it didn't make anything worse,
> >so I'll post it soon. Also, in order to try and improve how far set-next
> >can jump ahead in the queue, I tested a kernel with group scheduling
> >compiled out (libvirt uses cgroups and I'm not sure autogroups may affect
> >things). I did get slight improvement with that, but nothing to write home
> >to mom about.
> >
> >diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >index c219bf8..7d8a21d 100644
> >--- a/kernel/sched/fair.c
> >+++ b/kernel/sched/fair.c
> >@@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
> > if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
> > return false;
> >
> >+ /* We're yielding, so tell the scheduler we don't want to be picked */
> >+ yield_task_fair(rq);
> >+
> > /* Tell the scheduler that we'd really like pse to run next. */
> > set_next_buddy(se);
> >
> >- yield_task_fair(rq);
> >-
> > return true;
> > }
> >
>
> Hi Drew, Agree with your fix and tested the patch too.. results are
> pretty much same. puzzled why so.

Looking at the code I see that the next hint might be used more frequently
if we bump up sysctl/kernel.sched_wakeup_granularity_ns. I also just found
out that some virt tuned profiles do that, so maybe I should try running
with one of those profiles.

>
> thinking ... may be we hit this when #vcpu (of a VM) > #pcpu?
> (pigeonhole principle ;)).

Not sure, but I haven't done any experiments where a single VM has >
#vcpus than the system as pcpus. For my vcpu overcommit I increase the
VM count, where each VM has #vcpus <= #pcpus.

Drew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/