Re: [PATCH] sched/deadline: Skip overflow check if 0 capacity

From: Waiman Long
Date: Fri Nov 08 2024 - 13:42:04 EST


On 11/8/24 11:31 AM, Juri Lelli wrote:
On 08/11/24 08:39, Waiman Long wrote:
On 11/8/24 8:25 AM, Juri Lelli wrote:
On 07/11/24 23:29, Waiman Long wrote:
By properly setting up a 1-cpu sched domain (partition) with no
task, it was found that offlining that particular CPU failed because
dl_bw_check_overflow() in cpuset_cpu_inactive() returned -EBUSY. This
is due to the fact that dl_bw_capacity() return 0 as the sched domain
has no active CPU causing a false positive in the overflow check.

Fix this corner case by skipping the __dl_overflow() check in
dl_bw_manage() when the returned capacity is 0.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
kernel/sched/deadline.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index be1b917dc8ce..0195f350d6d3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3479,7 +3479,13 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
} else {
unsigned long cap = dl_bw_capacity(cpu);
- overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
+ /*
+ * In the unlikely case of 0 capacity (e.g. a sched domain
+ * with no active CPUs), skip the overflow check as it will
+ * always return a false positive.
+ */
+ if (likely(cap))
+ overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
The remaining total_bw that make this check fail should be the one
relative to the dl_server on the cpu that is going offline. Wonder if we
shouldn't rather clean that up (remove dl_server contribution) before we
get to this point during an hotplug operation. Need to think about it a
little more.
static inline bool
__dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
{
        return dl_b->bw != -1 &&
               cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
}

With a 0 cap, cap_scale(dl_b->bw, cap) will always be 0. As long as total_bw
isn't 0 and bw isn't -1, the condition will be true.
Right, but I fear that by hiding the special corner case we would also
miss the cases where we have DEADLINE tasks with bandwidth on that
single CPU and then ignore it.

So, maybe something like the below?

---
kernel/sched/deadline.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index be1b917dc8ce..7884e566557c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -130,11 +130,24 @@ static inline int dl_bw_cpus(int i)
if (cpumask_subset(rd->span, cpu_active_mask))
return cpumask_weight(rd->span);
+ /*
+ * Hotplug extreme case in which the last remaining online CPU in a
+ * root domain is going offline. We get here after that cpus has been
+ * cleared from cpu_active_mask, so the loop below would return 0
+ * CPUs, which of course doesn't make much sense. Return at least 1
+ * CPU.
+ */
+
+ if (cpumask_weight(rd->span) == 1)
+ return 1;
+
cpus = 0;
for_each_cpu_and(i, rd->span, cpu_active_mask)
cpus++;
+ WARN_ON(!cpus);
+
return cpus;
}
---

This patch looks good to me.

With this patch and my cpuset patches applied, the test_cpuset_prs.sh test passed without any error. You can add the following tags if you send this patch out.

Reported-by: Waiman Long <longman@xxxxxxxxxx>
Tested-by: Waiman Long <longman@xxxxxxxxxx

That said though, I believe I just found an additional issue. With the
above the system doesn't crash (it did w/o it), but happily moves
DEADLINE tasks out of a domain with a single CPU going offline. Last
time I looked at this we were properly checking and failing the hotplug
operation, but it was indeed a while ago, so not sure yet what changed.
More staring.

Oh, so broken, yay. :)

The cpuset code will move tasks in a partition with no effective CPUs to its parent.

Cheers,
Longman