Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete

From: Mukesh Ojha
Date: Fri Sep 23 2022 - 10:35:08 EST


Hi Peter,


On 9/7/2022 2:20 AM, Peter Zijlstra wrote:
On Tue, Sep 06, 2022 at 04:40:03PM -0400, Waiman Long wrote:

I've not followed the earlier stuff due to being unreadable; just
reacting to this..

We are able to reproduce this issue explained at this link

https://lore.kernel.org/lkml/88b2910181bda955ac46011b695c53f7da39ac47.camel@xxxxxxxxxxxx/



diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 838623b68031..5d9ea1553ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2794,9 +2794,9 @@ static int __set_cpus_allowed_ptr_locked(struct
task_struct *p,
                if (cpumask_equal(&p->cpus_mask, new_mask))
                        goto out;

-               if (WARN_ON_ONCE(p == current &&
-                                is_migration_disabled(p) &&
-                                !cpumask_test_cpu(task_cpu(p), new_mask)))
{
+               if (is_migration_disabled(p) &&
+                   !cpumask_test_cpu(task_cpu(p), new_mask)) {
+                       WARN_ON_ONCE(p == current);
                        ret = -EBUSY;
                        goto out;
                }
@@ -2818,7 +2818,11 @@ static int __set_cpus_allowed_ptr_locked(struct
task_struct *p,
        if (flags & SCA_USER)
                user_mask = clear_user_cpus_ptr(p);

-       ret = affine_move_task(rq, p, rf, dest_cpu, flags);
+       if (!is_migration_disabled(p) || (flags & SCA_MIGRATE_ENABLE)) {
+               ret = affine_move_task(rq, p, rf, dest_cpu, flags);
+       } else {
+               task_rq_unlock(rq, p, rf);
+       }

This cannot be right. There might be previous set_cpus_allowed_ptr()
callers that are blocked and waiting for the task to land on a valid
CPU.


Was thinking if just skipping as below will help here, well i am not sure .

But thinking what if we keep the task as it is on the same cpu and let's wait for migration to be enabled for the task to take care of it later.

------------------->O------------------------------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d90d37c..7717733 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2390,8 +2390,10 @@ static int migration_cpu_stop(void *data)
* we're holding p->pi_lock.
*/
if (task_rq(p) == rq) {
- if (is_migration_disabled(p))
+ if (is_migration_disabled(p)) {
+ complete = true;
goto out;
+ }

if (pending) {


-Mukesh