[PATCH AUTOSEL 6.19-5.15] tracing: Fix false sharing in hwlat get_sample()

From: Sasha Levin

Date: Sun Feb 15 2026 - 10:05:11 EST

From: Colin Lord <clord@xxxxxxxxxxx>

[ Upstream commit f743435f988cb0cf1f521035aee857851b25e06d ]

The get_sample() function in the hwlat tracer assumes the caller holds
hwlat_data.lock, but this is not actually happening. The result is
unprotected data access to hwlat_data, and in per-cpu mode can result in
false sharing which may show up as false positive latency events.

The specific case of false sharing observed was primarily between
hwlat_data.sample_width and hwlat_data.count. These are separated by
just 8B and are therefore likely to share a cache line. When one thread
modifies count, the cache line is in a modified state so when other
threads read sample_width in the main latency detection loop, they fetch
the modified cache line. On some systems, the fetch itself may be slow
enough to count as a latency event, which could set up a self
reinforcing cycle of latency events as each event increments count which
then causes more latency events, continuing the cycle.

The other result of the unprotected data access is that hwlat_data.count
can end up with duplicate or missed values, which was observed on some
systems in testing.

Convert hwlat_data.count to atomic64_t so it can be safely modified
without locking, and prevent false sharing by pulling sample_width into
a local variable.

One system this was tested on was a dual socket server with 32 CPUs on
each numa node. With settings of 1us threshold, 1000us width, and
2000us window, this change reduced the number of latency events from
500 per second down to approximately 1 event per minute. Some machines
tested did not exhibit measurable latency from the false sharing.

Cc: Masami Hiramatsu <mhiramat@xxxxxxxxxx>
Cc: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Link: https://patch.msgid.link/20260210074810.6328-1-clord@xxxxxxxxxxx
Signed-off-by: Colin Lord <clord@xxxxxxxxxxx>
Signed-off-by: Steven Rostedt (Google) <rostedt@xxxxxxxxxxx>
Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
---

LLM Generated explanations, may be completely bogus:

## Analysis of the Commit

### 1. Commit Message Analysis

The commit message is very detailed and clearly describes:
- **The bug**: `get_sample()` assumes `hwlat_data.lock` is held, but
it's not actually held. This leads to unprotected data access and
false sharing in per-cpu mode.
- **The specific false sharing**: `hwlat_data.sample_width` and
`hwlat_data.count` are 8 bytes apart, likely sharing a cache line.
When one thread modifies `count`, other threads reading `sample_width`
in the latency detection loop fetch the modified cache line, causing
measurable latency.
- **The self-reinforcing cycle**: The latency from cache line fetch
triggers a latency event, which increments `count`, which causes more
cache line invalidation, which causes more latency events — a vicious
cycle.
- **The data race**: `hwlat_data.count` can end up with duplicate or
missed values due to unprotected concurrent access.
- **Concrete test results**: On a dual-socket 32-CPUs-per-node server,
latency events dropped from 500/sec to ~1/min.

Keywords: "false sharing", "false positive latency events", "unprotected
data access", "duplicate or missed values."

### 2. Code Change Analysis

The changes are minimal and surgical:

1. **`hwlat_data.count` converted from `u64` to `atomic64_t`**: This
fixes a real data race where multiple threads could concurrently
increment `count` without synchronization, leading to lost updates
(duplicate/missed values). The `atomic64_inc_return()` replaces
`hwlat_data.count++; s.seqnum = hwlat_data.count;` with a single
atomic operation.

2. **`sample_width` pulled into a local variable with `READ_ONCE()`**:
`u64 sample_width = READ_ONCE(hwlat_data.sample_width);` is used
instead of reading `hwlat_data.sample_width` in the hot loop (`do {
... } while (total <= sample_width)`). This:
- Prevents the CPU from repeatedly fetching a cache line that may be
bouncing between cores
- Eliminates the false sharing between `sample_width` and `count`
- Uses `READ_ONCE()` for proper load semantics

3. **Init path updated**: `hwlat_data.count = 0` →
`atomic64_set(&hwlat_data.count, 0)` — consistent with the type
change.

4. **Comment fix**: Removes the incorrect claim that `get_sample()` is
"called with hwlat_data.lock held."

### 3. Bug Classification

This commit fixes **multiple real bugs**:

1. **Data race on `hwlat_data.count`**: Concurrent unsynchronized access
to a shared counter. This is a real correctness bug — sequence
numbers can be duplicated or skipped.

2. **False sharing causing false positive latency detection**: The hwlat
tracer is producing bogus results (500 events/sec vs 1/min). This is
a **functional correctness bug** — the tracer is reporting phantom
hardware latency that doesn't exist. Users relying on hwlat tracer
results would be misled.

3. **Self-reinforcing feedback loop**: The false sharing creates a
pathological cycle that makes the tracer practically unusable on some
multi-socket systems.

### 4. Scope and Risk Assessment

- **Lines changed**: ~15 lines of actual code changes across 4 hunks in
a single file
- **Files touched**: 1 (`kernel/trace/trace_hwlat.c`)
- **Risk**: Very low. The changes are:
- Converting a counter to atomic (well-understood primitive)
- Caching a value in a local variable (safe, the value doesn't need to
be re-read)
- Using `READ_ONCE()` (standard kernel pattern)
- **Subsystem**: Tracing — self-contained, well-maintained by Steven
Rostedt
- **Could it break something?**: Extremely unlikely. The atomic64
operations are well-tested primitives, and caching sample_width is
semantically equivalent (the width doesn't change during a sample).

### 5. User Impact

- **Who is affected**: Anyone using the hwlat tracer in per-cpu mode on
multi-socket systems
- **Severity**: The tracer produces wildly incorrect results (500x false
positives) on affected systems
- **Real-world impact**: Users trying to validate hardware latency
characteristics for real-time workloads would get completely
misleading data
- **The data race on count**: Could cause sequence number issues that
affect trace analysis tooling

### 6. Stability Indicators

- **Signed-off by Steven Rostedt** (tracing subsystem maintainer) — high
confidence
- **Concrete test data** provided showing dramatic improvement
- **Link to mailing list** discussion provided
- The fix uses standard kernel primitives (atomic64_t, READ_ONCE) —
well-understood patterns

### 7. Dependency Check

- No dependencies on other commits
- `atomic64_t` and `READ_ONCE()` are available in all stable kernel
versions
- The hwlat tracer exists in all recent stable trees (introduced in
v4.9)

### Stable Kernel Rules Assessment

1. **Obviously correct and tested**: Yes — tested on real hardware with
measurable improvement
2. **Fixes a real bug**: Yes — data race + false sharing causing
incorrect tracer output
3. **Important issue**: Yes — tracer producing 500x false positive
latency events, plus data race on counter
4. **Small and contained**: Yes — ~15 lines in one file
5. **No new features**: Correct — this is a pure bug fix
6. **Applies cleanly**: Should apply cleanly to recent stable trees

### Risk vs Benefit

- **Risk**: Near zero — atomic counter and local variable caching are
trivially safe changes
- **Benefit**: High — fixes a data race and makes the hwlat tracer
produce correct results on multi-socket systems

**YES**

kernel/trace/trace_hwlat.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 2f7b94e98317c..3fe274b84f1c2 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -102,9 +102,9 @@ struct hwlat_sample {
/* keep the global state somewhere. */
static struct hwlat_data {

- struct mutex lock; /* protect changes */
+ struct mutex lock; /* protect changes */

- u64 count; /* total since reset */
+ atomic64_t count; /* total since reset */

u64 sample_window; /* total sampling window (on+off) */
u64 sample_width; /* active sampling portion of window */
@@ -193,8 +193,7 @@ void trace_hwlat_callback(bool enter)
* get_sample - sample the CPU TSC and look for likely hardware latencies
*
* Used to repeatedly capture the CPU TSC (or similar), looking for potential
- * hardware-induced latency. Called with interrupts disabled and with
- * hwlat_data.lock held.
+ * hardware-induced latency. Called with interrupts disabled.
*/
static int get_sample(void)
{
@@ -204,6 +203,7 @@ static int get_sample(void)
time_type start, t1, t2, last_t2;
s64 diff, outer_diff, total, last_total = 0;
u64 sample = 0;
+ u64 sample_width = READ_ONCE(hwlat_data.sample_width);
u64 thresh = tracing_thresh;
u64 outer_sample = 0;
int ret = -1;
@@ -267,7 +267,7 @@ static int get_sample(void)
if (diff > sample)
sample = diff; /* only want highest value */

- } while (total <= hwlat_data.sample_width);
+ } while (total <= sample_width);

barrier(); /* finish the above in the view for NMIs */
trace_hwlat_callback_enabled = false;
@@ -285,8 +285,7 @@ static int get_sample(void)
if (kdata->nmi_total_ts)
do_div(kdata->nmi_total_ts, NSEC_PER_USEC);

- hwlat_data.count++;
- s.seqnum = hwlat_data.count;
+ s.seqnum = atomic64_inc_return(&hwlat_data.count);
s.duration = sample;
s.outer_duration = outer_sample;
s.nmi_total_ts = kdata->nmi_total_ts;
@@ -832,7 +831,7 @@ static int hwlat_tracer_init(struct trace_array *tr)

hwlat_trace = tr;

- hwlat_data.count = 0;
+ atomic64_set(&hwlat_data.count, 0);
tr->max_latency = 0;
save_tracing_thresh = tracing_thresh;

--
2.51.0