Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue

From: Chen, Yu C

Date: Wed Oct 29 2025 - 08:48:32 EST

On 10/29/2025 12:32 PM, K Prateek Nayak wrote:

Hello Tim,

On 10/28/2025 9:16 PM, Tim Chen wrote:

On Tue, 2025-10-28 at 23:15 +0800, Chen, Yu C wrote:

On 10/27/2025 2:04 PM, K Prateek Nayak wrote:

Hello Tim,

On 10/11/2025 11:54 PM, Tim Chen wrote:

@@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
struct rq *rq = rq_of(cfs_rq);
account_numa_enqueue(rq, task_of(se));
+ account_llc_enqueue(rq, task_of(se));
list_add(&se->group_node, &rq->cfs_tasks);
}
cfs_rq->nr_queued++;
@@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_load_sub(&cfs_rq->load, se->load.weight);
if (entity_is_task(se)) {
account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+ account_llc_dequeue(rq_of(cfs_rq), task_of(se));
list_del_init(&se->group_node);
}
cfs_rq->nr_queued--;
+
+ /* safeguard to clear the cache aware data */
+ if (!parent_entity(se) && !cfs_rq->nr_queued)
+ reset_llc_stats(rq_of(cfs_rq));

Instead of relying on reset_llc_stats() hack, I think a better approach
would be to have a "p->se.llc_sched_active" flag similar to how uclamp
has "uc_se->active" and we set this in account_llc_enqueue() which will
still check for sched_cache_enabled() but account_llc_dequeue() would
only check for "p->se.llc_sched_active" to decrement the stats and then
unset the flag.

That way, we cannot have an imbalanced accounting. Thoughts?

I suppose what you mean is to avoid the race condition between
enabling sched_cache and EQ/DE_LLC, similar to uclamp:

enqueue(taskA)
// sched_cache gets enabled
enqueue(taskB)
dequeue(taskA)
// Must not decrement rq->llc_pref for taskA

For this case, task A is already on rq when sched cache get
enabled. But task A's preferred_llc is still -1.

If we dequeue it while its preferred_llc is still -1, it won't
affect rq->llc_pref.

If we change its preferred_llc to llc_i before we dequeue it,
then rq->llc_pref[llc_i] will be incremented first.

Then when we dequeue task A, we will decrement it. We are
still accounting rq->llc_pref[llc_i] correctly with current
code.

So what I really disliked was having reset_llc_stats() to
reset the stat but looking at it again, that too is guarded
by sched_cache_enabled() counter so I think the counters can
still go out of balance if:

/* Cache aware scheduling enabled */
enqueue(TaskA) /* nr_llc_running = 1 */
enqueue(TaskB) /* nr_llc_running = 2 */
enqueue(TaskC) /* nr_llc_running = 3 */
dequeue(TaskA) /* nr_llc_running = 2 */

/* Cache aware scheduling disabled */

dequeue(TaskB) /* nr_llc_running = 2 */

If we introduce the mechanism you suggested previously:
"enable p->llc_sched_active in account_llc_enqueue(), which will
still check sched_cache_enabled(), but account_llc_dequeue() only
checks p->llc_sched_active to decrement the stats. Then the above
scenario might be covered: dequeue(TaskB) will decrease nr_llc_running
even if cache aware is disabled. Another idea is to reset all CPU
statistics when cache aware scheduling is disabled at runtime, this
might also avoid several race conditions, for example cpu hotplug vs
cache aware scheduling.

thanks,
Chenyu