Re: [RFC PATCH v2 00/17] Core scheduling v2

From: Aaron Lu
Date: Wed May 08 2019 - 22:13:51 EST


On Wed, May 08, 2019 at 01:49:09PM -0400, Julien Desfossez wrote:
> On 08-May-2019 10:30:09 AM, Aaron Lu wrote:
> > On Mon, May 06, 2019 at 03:39:37PM -0400, Julien Desfossez wrote:
> > > On 29-Apr-2019 11:53:21 AM, Aaron Lu wrote:
> > > > This is what I have used to make sure no two unmatched tasks being
> > > > scheduled on the same core: (on top of v1, I thinks it's easier to just
> > > > show the diff instead of commenting on various places of the patches :-)
> > >
> > > We imported this fix in v2 and made some small changes and optimizations
> > > (with and without Peterâs fix from https://lkml.org/lkml/2019/4/26/658)
> > > and in both cases, the performance problem where the core can end up
> >
> > By 'core', do you mean a logical CPU(hyperthread) or the entire core?
> No I really meant the entire core.
>
> Iâm sorry, I should have added a little bit more context. This relates
> to a performance issue we saw in v1 and discussed here:
> https://lore.kernel.org/lkml/20190410150116.GI2490@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#mb9f1f54a99bac468fc5c55b06a9da306ff48e90b
>
> We proposed a fix that solved this, Peter came up with a better one
> (https://lkml.org/lkml/2019/4/26/658), but if we add your isolation fix
> as posted above, the same problem reappears. Hope this clarifies your
> ask.

It's clear now, thanks.
I don't immediately see how my isolation fix would make your fix stop
working, will need to check. But I'm busy with other stuffs so it will
take a while.

>
> I hope that we did not miss anything crucial while integrating your fix
> on top of v2 + Peterâs fix. The changes are conceptually similar, but we
> refactored it slightly to make the logic clear. Please have a look and
> let us know

I suppose you already have a branch that have all the bits there? I
wonder if you can share that branch somewhere so I can start working on
top of it to make sure we are on the same page?

Also, it would be good if you can share the workload, cmdline options,
how many workers need to start etc. to reproduce this issue.

Thanks.