On Tue, May 01, 2018 at 01:20:26PM +0530, Kohli, Gaurav wrote:
But In our older case, where we have seen failure below is the wake up path
and ftraces, Wakeup occured and completed before schedule call only.
So final state of CPUHP is running not parked. I have also pasted debug
ftraces that we got during issue reproduction.
Here wakeup for cpuhp is below:
takedown_cpu-> kthread_park-> wake_up_process
39,034,311,742,395 apps (10240) Trace Printk cpuhp/0 (16) [000]
39015.625000: <debug> __kthread_parkme state=512 task=ffffffcc7458e680
flags: 0x5 -> state 5 -> state is parked inside parkme function
39,034,311,846,510 apps (10240) Trace Printk cpuhp/0 (16) [000]
39015.625000: <debug> before schedule __kthread_parkme state=0
task=ffffffcc7458e680 flags: 0xd -> just before schedule call, state is
running
tatic void __kthread_parkme(struct kthread *self)
{
__set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
__set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
}
So my point is here also, if it is reschedule then it can set TASK_PARKED,
but it seems after takedown_cpu call this thread never get a chance to run,
So final state is TASK_RUNNING.
In our current fix also can't we observe same scenario where final state is
TASK_RUNNING.
I'm not sure I understand your concern. Loosing the TASK_PARKED store
with the above code is obviously bad. But with the loop as proposed I
don't see a problem.
takedown_cpu() can proceed beyond smpboot_park_threads() and kill the
CPU before any of the threads are parked -- per having the complete()
before hitting schedule().
And, afaict, that is harmless. When we go offline, sched_cpu_dying() ->
migrate_tasks() will migrate any still runnable threads off the cpu.
But because at this point the thread must be in the PARKED wait-loop, it
will hit schedule() and go to sleep eventually.
Also note that kthread_unpark() does __kthread_bind() to rebind the
threads.
Aaaah... I think I've spotted a problem there. We clear SHOULD_PARK
before we rebind, so if the thread lost the first PARKED store,
does the completion, gets migrated, cycles through the loop and now
observes !SHOULD_PARK and bails the wait-loop, then __kthread_bind()
will forever wait.
Is that what you had in mind?
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html