[PATCH] [BUG_FIX] scan period increase until max when migrate fail

From: Binwon Song
Date: Fri Apr 04 2025 - 05:54:21 EST


Signed-off-by: Binwon Song <qlsdnjs236@xxxxxxxxxxxxxx>
Co-developed-by: Heesn Jo <heesn.jo@xxxxxxxxx>

We observe that the NUMA scan period does not change properly. The dynamic adjustment of NUMA scan period aims to find an appropriate scan period,
but I believe the current implementation does not behave as intended.

Once a migration fails, the uninitialized values cause the if condition to be repeatedly satisfied, resulting in an immediate return.
This occurs because the initialization code is placed below the conditional block that checks for migration failure.
as a result, the NUMA scan period (numa_scan_period) keeps increasing with every placement evaluation,
eventually reaching its maximum value. This can significantly delay subsequent NUMA balancing attempts,
even in cases where retries might be beneficial.

To prevent this, proper initialization is required. However, upon reviewing numa_faults_locality[2],
we found that it only increases but is never used elsewhere. Given this, it is better to remove it entirely.

To address this issue, we utilize the task->numa_pages_migrated variable to track migration failures.
Initially, this variable was accumulated only when TNF_MIGRATE_SUCCESS was set, but we now explicitly reset it to 0 when TNF_MIGRATE_FAIL is set.
This allows us to reliably detect migration failures (numa_pages_migrated == 0 → migration failed).
With this approach, the numa_faults_locality[2] value is no longer necessary.
Since numa_pages_migrated is not used elsewhere, using it for this purpose does not cause any issues.

Additionally, we confirmed that vmstat tracks only the number of migrated pages, which is unrelated to numa_faults_locality or numa_pages_migrated.

Ultimately, this bug fix enhances the responsiveness of NUMA migration and reduces the size of task_struct by removing an unnecessary long variable.
---
include/linux/sched.h | 2 +-
kernel/sched/fair.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..cd0aa51d85ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1394,7 +1394,7 @@ struct task_struct {
* period is adapted based on the locality of the faults with different
* weights depending on whether they were shared or private faults
*/
- unsigned long numa_faults_locality[3];
+ unsigned long numa_faults_locality[2];

unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..5576a784d1b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2717,7 +2717,7 @@ static void update_task_scan_period(struct task_struct *p,
* migration then it implies we are migrating too quickly or the local
* node is overloaded. In either case, scan slower
*/
- if (local + shared == 0 || p->numa_faults_locality[2]) {
+ if (local + shared == 0 || !p->numa_pages_migrated) {
p->numa_scan_period = min(p->numa_scan_period_max,
p->numa_scan_period << 1);

@@ -3237,7 +3237,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
if (migrated)
p->numa_pages_migrated += pages;
if (flags & TNF_MIGRATE_FAIL)
- p->numa_faults_locality[2] += pages;
+ p->numa_pages_migrated = 0;

p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
--
2.34.1