Calculate the latency of instructions xsave and xrstor with new trace
points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.
The delta TSC can be calculated within a single trace event.
option considered was to have 2 separated trace events marking the
start and finish of the xsave/xrstor instructions. The delta TSC was
calculated from the 2 trace points in user space, but there was
significant overhead added by the trace function itself.
To enable it, CONFIG_X86_DEBUG_FPU and CONFIG_TRACEPOINTS are required.
The compiler can get rid of all the extra crust when either of the two
configs is disabled.
If both of the two configs are enabled, xsave/xrstor_tracing_enabled
would be reduced to a static check for tracing enabled. Thus, in the
fast path there would be only 2 additional static checks.
Leave the noise here instead of additional
conditions before calling the x86_fpu_latency_* because that makes the
latency more accurate and it's easy to filer the noise out by the
following consuming script.
+DECLARE_EVENT_CLASS(x86_fpu_latency,
+ TP_PROTO(struct fpstate *fpstate, u64 dtsc),
+ TP_ARGS(fpstate, dtsc),
+
+ TP_STRUCT__entry(
+ __field(struct fpstate *, fpstate)
+ __field(u64, dtsc)
+ __field(u64, rfbm)
+ __field(u64, xinuse)
+ ),
+
+ TP_fast_assign(
+ __entry->fpstate = fpstate;
+ __entry->dtsc = dtsc;
+ __entry->rfbm = fpstate->xfeatures;
+ __entry->xinuse = fpstate->regs.xsave.header.xfeatures;
+ ),
+
+ TP_printk("x86/fpu: latency:%lld RFBM:0x%llx XINUSE:0x%llx",
+ __entry->dtsc,
+ __entry->rfbm,
+ __entry->xinuse
+ )
+);
+