Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning

From: Heiko Carstens
Date: Mon Aug 15 2016 - 07:19:23 EST


On Mon, Aug 08, 2016 at 03:45:05PM +0800, Ming Lei wrote:
> On Sat, Jul 30, 2016 at 7:25 PM, Heiko Carstens
> <heiko.carstens@xxxxxxxxxx> wrote:
> > On Wed, Jul 27, 2016 at 05:23:05PM +0200, Thomas Gleixner wrote:
> >> On Wed, 27 Jul 2016, Heiko Carstens wrote:
> >> > [ 3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8)
> >> > [ 3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478)
> >> > [ 3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0)
> >> > [ 3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438)
> >>
> >> > For some unknown reason select_task_rq() gets called with a task that has
> >> > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));"
> >> > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the
> >> > warning later on.
> >> >
> >> > It only happens with more than one node, otherwise it seems to work fine.
> >> >
> >> > Any idea what could be wrong here?
> >>
> >> create_worker()
> >> tsk = kthread_create_on_node();
> >> kthread_bind_mask(tsk, pool->attrs->cpumask);
> >> do_set_cpus_allowed(tsk, mask);
> >> set_cpus_allowed_common(tsk, mask);
> >> cpumask_copy(&tsk->cpus_allowed, mask);
> >> tsk->nr_cpus_allowed = cpumask_weight(mask);
> >> wake_up_process(task);
> >>
> >> So this looks like pool->attrs->cpumask is simply empty.....
> >
> > Just had some time to look into this a bit more. Looks like we initialize
> > the cpu_to_node_masks (way) too late on s390 for fake numa. So Peter's
> > patch just revealed that problem.
> >
> > I'll see if initializing the masks earlier will fix this, but I think it
> > will.
>
> Hello,
>
> Is there any fix for this issue? I can see the issue on arm64 running
> v4.7 kernel too. And the oops can be avoided by reverting commit
> e9d867a(sched: Allow per-cpu kernel threads to run on online && !active).

I don't know about the arm64 issue. The s390 problem is a result from
initializing the cpu_to_node mapping too late.

However, the workqueue code seems to assume that we know the cpu_to_node
mapping for all _possible_ cpus very early and apparently it assumes that
this mapping is stable and doesn't change anymore.

This assumption however contradicts the purpose of 346404682434 ("numa, cpu
hotplug: change links of CPU and node when changing node number by onlining
CPU").

So something is wrong here...

On s390 with fake numa we wouldn't even know the mapping of all _possible_
cpus at boot time. When establishing the node mapping we try hard to map
our existing cpu topology into a sane node mapping. However we simply don't
know where non-present cpus are located topology-wise. Even for present
cpus the answer is not always there since present cpus can be in either the
state "configured" (topology location known - cpu online possible) or
"deconfigured" (topology location unknown - cpu online not possible).

I can imagine several ways to fix this for s390, but before doing that I'm
wondering if the workqueue code is correct with

a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus
that early

and

b) that the cpu_to_node() mapping does never change

Tejun?