Re: [PATCH v3 0/2] governor: Fix races and stale baseline on prev_cpu_nice

From: Zhongqiu Han

Date: Wed May 27 2026 - 10:30:04 EST


On 5/27/2026 1:09 AM, Rafael J. Wysocki wrote:
On Sun, May 24, 2026 at 7:57 AM Zhongqiu Han
<zhongqiu.han@xxxxxxxxxxxxxxxx> wrote:

On 5/22/2026 11:14 PM, Rafael J. Wysocki wrote:
On Sun, Apr 19, 2026 at 3:27 PM Zhongqiu Han
<zhongqiu.han@xxxxxxxxxxxxxxxx> wrote:

Patch 1 fixes two baseline-consistency issues in the DBS governor
framework.

gov_update_cpu_data() writes prev_cpu_idle and prev_cpu_nice while
holding only attr_set->update_lock, whereas dbs_update() reads and
writes the same fields while holding only policy_dbs->update_mutex.
Because these are independent locks, the two paths are not mutually
exclusive. The fix acquires policy_dbs->update_mutex inside
gov_update_cpu_data() for each policy.

Additionally, cpufreq_dbs_governor_start() reads io_is_busy before
acquiring policy_dbs->update_mutex. A concurrent io_is_busy_store()
can update io_is_busy and call gov_update_cpu_data(), which writes
prev_cpu_idle with the new value under the mutex.
cpufreq_dbs_governor_start() then overwrites prev_cpu_idle with the
stale value. The fix reads io_is_busy inside the mutex.

Patch 2 fixes a stale-baseline race on prev_cpu_nice when
ignore_nice_load is enabled via sysfs. There is a race window
between setting ignore_nice_load and gov_update_cpu_data() acquiring
update_mutex: a concurrent dbs_update() can see the new ignore_nice_load
value but read a stale prev_cpu_nice, producing an inflated idle_time
for that sample.

The fix unconditionally advances prev_cpu_nice on every dbs_update()
call. With prev_cpu_nice always reflecting the most recent sample,
enabling ignore_nice_load can never produce a stale-baseline spike.
To keep prev_cpu_nice handling consistent, gov_update_cpu_data()
resets prev_cpu_nice unconditionally alongside prev_cpu_idle, and
cpufreq_dbs_governor_start() initializes prev_cpu_nice unconditionally.

v3:
- Update linux-next base
- Fix two additional issues found during sashiko.dev review:
- Patch 1: move io_busy read inside policy_dbs->update_mutex in
cpufreq_dbs_governor_start() to close a TOCTOU window against
concurrent io_is_busy_store().
- Patch 2: restore unconditional prev_cpu_nice reset in
gov_update_cpu_data() to keep both baselines consistent when
io_is_busy changes with ignore_nice_load enabled; update comment.
- Patch 2: keep gov_update_cpu_data() in ignore_nice_load_store() to
prevent a stale-baseline spike in the governor-start window (when
ignore_nice_load is set before the first dbs_update() call); only
cpufreq_governor.c is modified. Update commit message and add one
more fixes tag '326c86deaed54a ("[CPUFREQ] Remove unneeded locks").'
- Update the commit messages.
- Link to v2: https://lore.kernel.org/all/20260409111407.9775-1-zhongqiu.han@xxxxxxxxxxxxxxxx/
- Sashiko v2 comments: https://sashiko.dev/#/patchset/20260409111407.9775-1-zhongqiu.han%40oss.qualcomm.com

So sashiko.dev still has some comments on the first patch in v3:

https://sashiko.dev/#/patchset/20260419132655.3800673-1-zhongqiu.han%40oss.qualcomm.com

Can you please have a look at that and let me know what you think?



Hi Rafael,
Thanks for pointing this out. I've looked at both comments:

Comment 1 (io_is_busy transition window):

The race is real but pre-existing. After Patch 1 the only remaining
window is the gap between the io_is_busy write and gov_update_cpu_data()
acquiring the lock:

CPU0 (io_is_busy_store): CPU1 (dbs_update):
io_is_busy = 1 mutex_lock()
/* waiting for mutex */ io_busy = 1 /* sees new value */
cur_idle = get_cpu_idle_time(io_busy=1)
/* prev_cpu_idle still set with io_busy=0 */
/* idle_time may be clamped to 0 */
/* load inflated for this sample */
mutex_unlock()
mutex_lock()
prev_cpu_idle = ../* reset */
mutex_unlock()

The mismatch between the two metrics causes a one-sample anomaly in
either direction: 0→1 tends to inflate the load estimate (frequency
pushed upward), while 1→0 tends to deflate it. The AI's "drives to
minimum" conclusion is wrong for the 0→1 case.

This window is inherent to the design. One possible fix is to add a per
policy io_is_busy copy inside policy_dbs_info and update it under
update_mutex in gov_update_cpu_data(), so dbs_update() always reads a
value consistent with the current prev_cpu_idle baseline. The impact
should be a single-sample frequency spike that self-corrects.


Comment 2 (sampling_rate_store vs. gov_set_update_util):

Also pre-existing. The race:

cpufreq_dbs_governor_start(): sampling_rate_store():
mutex_unlock() mutex_lock()
gov->start() sample_delay_ns = 0
gov_set_update_util() mutex_unlock()
sample_delay_ns = rate <--->

If needed, e possible approach would be to move both gov->start() and
gov_set_update_util() inside the mutex would work.

Yeah.

So I've applied both patches in the series as 7.2 material.

Please feel free to send a follow-up patch making the change mentioned above.

Thanks!


Thanks Rafael,

Sure, I'll look into that and follow up with a patch accordingly.


--
Thx and BRs,
Zhongqiu Han