Re: [PATCH v7 1/3] x86/fpu: Measure the Latency of XSAVE and XRSTOR

From: Yi Sun
Date: Thu Sep 21 2023 - 15:44:08 EST


On 21.09.2023 09:03, Ingo Molnar wrote:

* Yi Sun <yi.sun@xxxxxxxxx> wrote:

-#define XSTATE_XRESTORE(st, lmask, hmask) \
+#define __XSTATE_XRESTORE(st, lmask, hmask) \
asm volatile(ALTERNATIVE(XRSTOR, \
XRSTORS, X86_FEATURE_XSAVES) \
"\n" \
@@ -140,6 +143,35 @@ static inline u64 xfeatures_mask_independent(void)
: "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
: "memory")

+#if defined(CONFIG_X86_DEBUG_FPU)
+#define XSTATE_XSAVE(fps, lmask, hmask, err) \
+ do { \
+ struct fpstate *f = fps; \
+ u64 tc = -1; \
+ if (tracepoint_enabled(x86_fpu_latency_xsave)) \
+ tc = trace_clock(); \
+ __XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err); \
+ if (tracepoint_enabled(x86_fpu_latency_xsave)) \
+ trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
+ } while (0)
+
+#define XSTATE_XRESTORE(fps, lmask, hmask) \
+ do { \
+ struct fpstate *f = fps; \
+ u64 tc = -1; \
+ if (tracepoint_enabled(x86_fpu_latency_xrstor)) \
+ tc = trace_clock(); \
+ __XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask); \
+ if (tracepoint_enabled(x86_fpu_latency_xrstor)) \
+ trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\

This v7 version does not adequately address the review feedback I gave for
v6: it adds tracing overhead to potential hot paths, and putting it behind
CONFIG_X86_DEBUG_FPU is not a solution either: it's default-y, so de-facto
enabled on all major distros...
Hi Ingo,

Appreciate your comments!

What do you think if I were to add a new config here instead of
CONFIG_X86_DEBUG_FPU, and set it as the default value of n?
This way, all the overhead will be removed from the hot paths.
```
#else
#define XSTATE_XSAVE(fps, lmask, hmask, err)
__XSTATE_XSAVE(&(fps)->regs.xsave, lmask, hmask, err)
#define XSTATE_XRESTORE(fps, lmask, hmask)
__XSTATE_XRESTORE(&(fps)->regs.xsave, lmask, hmask)
#endif
```

It seems unnecessarily complex: why does it have to measure latency
directly? Tracepoints *by default* come with event timestamps. A latency
measurement tool should be able to subtract two timestamps to extract the
latency between two tracepoints...

In fact, function tracing is enabled on all major Linux distros:

kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_FUNCTION_TRACER=y

Why not just enable function tracing for the affected FPU context switching
functions?

The relevant functions are already standalone in a typical kernel:

xsave: # ffffffff8103cfe0 T save_fpregs_to_fpstate
xrstor: # ffffffff8103d160 T restore_fpregs_from_fpstate
xrstor_supervisor: # ffffffff8103dc50 T fpu__clear_user_states

... and FPU context switching overhead dominates the cost of these
functions.

I am aware of these trace points, but we add new ones for two reasons:
1. As I described in the patch comments. It would be more accurate to
calculate the latency in a single trace event, which will NOT include the
latency of the trace functions themselves. The cost of the trace
functions is much longer than the latency of xsaves, by our experiments.

Calculating the latency by subtracting two timestamps has to
contain the latency of trace points, that would make the actual latency
data less prominent.

2. We want to measure with finer granularity. The new added trace points
dump XINUSE and RFBM each time they are called. By doing so, we can
collect the latency and analyze the data grouped according to each FPU
instruction indicated in the XCR0 bits.

Thanks
--Sun, Yi