Re: [PATCH v3 1/2] x86, msr: allow rdmsr_safe_on_cpu() to schedule

From: Ingo Molnar
Date: Mon Mar 26 2018 - 02:40:51 EST



* Borislav Petkov <bp@xxxxxxxxx> wrote:

> On Sat, Mar 24, 2018 at 07:29:48AM -0700, Eric Dumazet wrote:
> > It is named gsysd, "Google System Tool", a daemon+cli that is run
> > on all machines in production to provide a generic interface
> > for interacting with the system hardware.
>
> So I'm wondering if poking at the hardware like that is a really optimal
> design. Maybe it would be cleaner if the OS would provide properly
> abstracted sysfs interfaces instead of raw MSRs.

It's not ideal to read /dev/msr.

A SysFS interface to enumerate 'system statistics' MSRs already exists via perf,
under:

/sys/bus/event_source/devices/msr/events/

This is for simple platform specific hardware statistics counters.

This is implemented in arch/x86/events/msr.c, with a number of MSRs already
abstracted out:

enum perf_msr_id {
PERF_MSR_TSC = 0,
PERF_MSR_APERF = 1,
PERF_MSR_MPERF = 2,
PERF_MSR_PPERF = 3,
PERF_MSR_SMI = 4,
PERF_MSR_PTSC = 5,
PERF_MSR_IRPERF = 6,
PERF_MSR_THERM = 7,
PERF_MSR_THERM_SNAP = 8,
PERF_MSR_THERM_UNIT = 9,
PERF_MSR_EVENT_MAX,
};

It's very easy to add new MSRs:

9ae21dd66b97: perf/x86/msr: Add support for MSR_IA32_THERM_STATUS
aaf248848db5: perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter
8a2242618477: perf/x86/msr: Add AMD PTSC (Performance Time-Stamp Counter) support

This commit added an MSR to msr-index and added MSR event support for it as well
with proper CPU-ID dependent enumeration:

8a2242618477: perf/x86/msr: Add AMD PTSC (Performance Time-Stamp Counter) support

arch/x86/events/msr.c | 8 ++++++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 1 +
3 files changed, 10 insertions(+)

More complex MSR value encodings are supported as well - see for example
MSR_IA32_THERM_STATUS, which is encoded as:

/* if valid, extract digital readout, other set to -1 */
now = now & (1ULL << 31) ? (now >> 16) & 0x3f : -1;
local64_set(&event->count, now);

If an MSR counter is a plain integer counter then no such code has to be added.

Only those MSRs that are valid on a system are 'live', and thus tooling can
enumerate them programmatically and has easy symbolic access to them:

galatea:~> perf list | grep msr/

msr/aperf/ [Kernel PMU event]
msr/cpu_thermal_margin/ [Kernel PMU event]
msr/mperf/ [Kernel PMU event]
msr/smi/ [Kernel PMU event]
msr/tsc/ [Kernel PMU event]

These can be used as simple counters that are read - and it's using perf not
/dev/msr so they are fast and scalable - but the full set of perf features is also
available, so we can do:

galatea:~> perf stat -a -e msr/aperf/ sleep 1

Performance counter stats for 'system wide':

354,442,500 msr/aperf/

1.001918252 seconds time elapsed

This interface is vastly superior to /dev/msr: it does not have the various
usability, scalability and security limitations of /dev/msr.

/dev/msr is basically for debugging.

If performance matters then the relevant MSR events should be added and gsysd
should be updated to support MSR events.

Thanks,

Ingo