x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status

From: chuang

Date: Mon Oct 27 2025 - 03:51:34 EST


Dear FPU/x86 Maintainers,

I am writing to report an issue concerning the accuracy of AVX-512
usage tracking, specifically when querying the information via
'/proc/<pid>/arch_status' on systems supporting the instruction set.

This report references the mechanism introduced by the following
patch: https://lore.kernel.org/all/20190117183822.31333-1-aubrey.li@xxxxxxxxx/T/#u

I have validated the patch's effect in modern environments supporting
AVX-512 (e.g., Intel Xeon Gold, AMD Zen4) and found that the tracking
mechanism does not accurately reflect the actual AVX-512 instruction
usage by the process.

Test Environment:
- CPU: Intel Xeon Gold (AVX-512 supported)
- Test Program: periodic_wake.c (Verified via objdump to not contain
any AVX-512 instructions.)
- Test Goal: To compare AVX-512 execution status as reported by perf
PMU versus procfs arch_status.

perf PMU:

$ perf stat -e instructions,cycles,fp_arith_inst_retired.512b_packed_double,fp_arith_inst_retired.512b_packed_single,fp_arith_inst_retired.8_flops,fp_arith_inst_retired2.128bit_packed_bf16,fp_arith_inst_retired2.256bit_packed_bf16,fp_arith_inst_retired2.512bit_packed_bf16
./periodic_wake > /dev/null
^C./periodic_wake: Interrupt

Performance counter stats for './periodic_wake':

2,329,116 instructions # 2.86
insn per cycle (33.57%)
814,040 cycles
(56.61%)
0 fp_arith_inst_retired.512b_packed_double
(9.82%)
<not counted> fp_arith_inst_retired.512b_packed_single
(0.00%)
<not counted> fp_arith_inst_retired.8_flops
(0.00%)
<not counted> fp_arith_inst_retired2.128bit_packed_bf16
(0.00%)
<not counted> fp_arith_inst_retired2.256bit_packed_bf16
(0.00%)
<not counted> fp_arith_inst_retired2.512bit_packed_bf16
(0.00%)

1.366220977 seconds time elapsed

0.000000000 seconds user
0.002253000 seconds sys


procfs arch_status:

$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms: 44
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms: 64
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms: 91
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms: 50

Based on the observed behavior and a review of the referenced patch,
my hypothesis is:

On AVX-512 capable systems, the implementation appears to record the
current timestamp into 'task->thread.fpu.avx512_timestamp' upon any
task switch, irrespective of whether the task has actually executed an
AVX-512 instruction.

This continuous updating of the timestamp, even for non-AVX-512 tasks,
results in misleading non-zero values for AVX512_elapsed_ms, rendering
the mechanism ineffective for accurately determining if a task is
actively utilizing AVX-512.

Could you please confirm if this analysis is correct and advise on the
appropriate next steps to resolve this discrepancy?

'periodic_wake.c':

#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <errno.h>

// Define wakeup interval as 100 milliseconds
#define INTERVAL_MS 100

int main() {
// Convert milliseconds to nanoseconds
long interval_ns = (long)INTERVAL_MS * 1000000L;

// timespec struct used for nanosleep
struct timespec requested;
struct timespec remaining;

// Initialize the requested time structure
requested.tv_sec = 0;
requested.tv_nsec = interval_ns;

printf("C Periodic Wakeup Program started (Interval: %dms,
%.9ldns). Press Ctrl+C to stop.\n",
INTERVAL_MS, interval_ns);

long long counter = 0;

while (1) {
counter++;

// Print current wakeup information
printf("Wakeup #%lld: Continuing execution.\n", counter);

// Use nanosleep for high-precision sleep.
// If nanosleep is interrupted by a signal (e.g., Ctrl+C), it
returns -1 and stores the remaining time in 'remaining'.
// To maintain accurate periodicity, we re-sleep for the
remaining time if an interruption occurs.

remaining.tv_sec = requested.tv_sec;
remaining.tv_nsec = requested.tv_nsec;

int result;

do {
// Sleep
result = nanosleep(&remaining, &remaining);

// Check return value
if (result == -1) {
if (errno == EINTR) {
// Interrupted by a signal (e.g., debugger or
Ctrl+C), continue sleeping for remaining time
printf("[Interrupted] nanosleep was interrupted by
a signal, sleeping for remaining %.3fms\n",
(double)remaining.tv_nsec / 1000000.0);
// Loop continues, using the remaining time stored
in 'remaining'
} else {
// Other error, print error and exit
perror("nanosleep error");
return 1;
}
}
} while (result == -1 && errno == EINTR);

// If nanosleep returns 0 successfully, continue to the next
loop iteration
}

return 0; // Theoretically unreachable
}


Thank you for your time and assistance.

Best regards,