Re: [PATCH 2/2] x86/msr: Allow unprivileged read access to some MSRs

From: H. Peter Anvin
Date: Tue May 23 2023 - 16:33:02 EST


On May 23, 2023 12:49:49 PM PDT, Tim Wiederhake <twiederh@xxxxxxxxxx> wrote:
>Delaying access control allows unprivileged processes to
>read specific MSRs, such as IA32_CORE_CAPABILITIES and
>IA32_ARCH_CAPABILITIES. This is helpful for e.g. qemu and
>libvirt who require the raw MSR content to calculate host
>CPU capabilities. Other programs might be interested in
>IA32_EFER for x86-64-v1 detection.
>
>Signed-off-by: Tim Wiederhake <twiederh@xxxxxxxxxx>
>---
> arch/x86/kernel/msr.c | 38 +++++++++++++++++++++++++++++++++-----
> 1 file changed, 33 insertions(+), 5 deletions(-)
>
>diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c
>index 058f2b67d0c7..9485aa7f8161 100644
>--- a/arch/x86/kernel/msr.c
>+++ b/arch/x86/kernel/msr.c
>@@ -50,6 +50,23 @@ enum allow_write_msrs {
>
> static enum allow_write_msrs allow_writes = MSR_WRITES_DEFAULT;
>
>+static int filter_read(struct file *file, u32 reg)
>+{
>+ if (file->private_data)
>+ return 0;
>+
>+ switch (reg) {
>+ case MSR_IA32_CORE_CAPS:
>+ case MSR_IA32_ARCH_CAPABILITIES:
>+ case MSR_EFER:
>+ return 0;
>+ default:
>+ break;
>+ }
>+
>+ return -EPERM;
>+}
>+
> static ssize_t msr_read(struct file *file, char __user *buf,
> size_t count, loff_t *ppos)
> {
>@@ -59,6 +76,10 @@ static ssize_t msr_read(struct file *file, char __user *buf,
> int cpu = iminor(file_inode(file));
> int err = 0;
>
>+ err = filter_read(file, reg);
>+ if (err)
>+ return err;
>+
> if (count < 8)
> return -EINVAL; /* Invalid chunk size */
>
>@@ -71,7 +92,7 @@ static ssize_t msr_read(struct file *file, char __user *buf,
> return 8;
> }
>
>-static int filter_write(u32 reg)
>+static int filter_write(struct file *file, u32 reg)
> {
> /*
> * MSRs writes usually happen all at once, and can easily saturate kmsg.
>@@ -83,6 +104,9 @@ static int filter_write(u32 reg)
> */
> static DEFINE_RATELIMIT_STATE(fw_rs, 30 * HZ, 1);
>
>+ if (!file->private_data)
>+ return -EPERM;
>+
> switch (allow_writes) {
> case MSR_WRITES_ON: return 0;
> case MSR_WRITES_OFF: return -EPERM;
>@@ -113,7 +137,7 @@ static ssize_t msr_write(struct file *file, const char __user *buf,
> if (err)
> return err;
>
>- err = filter_write(reg);
>+ err = filter_write(file, reg);
> if (err)
> return err;
>
>@@ -156,6 +180,9 @@ static long msr_ioctl(struct file *file, unsigned int ioc, unsigned long arg)
> err = -EFAULT;
> break;
> }
>+ err = filter_read(file, regs[1]);
>+ if (err)
>+ return err;
> err = rdmsr_safe_regs_on_cpu(cpu, regs);
> if (err)
> break;
>@@ -176,7 +203,7 @@ static long msr_ioctl(struct file *file, unsigned int ioc, unsigned long arg)
> if (err)
> break;
>
>- err = filter_write(regs[1]);
>+ err = filter_write(file, regs[1]);
> if (err)
> return err;
>
>@@ -202,8 +229,7 @@ static int msr_open(struct inode *inode, struct file *file)
> unsigned int cpu = iminor(file_inode(file));
> struct cpuinfo_x86 *c;
>
>- if (!capable(CAP_SYS_RAWIO))
>- return -EPERM;
>+ file->private_data = (void *)(capable(CAP_SYS_RAWIO));
>
> if (cpu >= nr_cpu_ids || !cpu_online(cpu))
> return -ENXIO; /* No such CPU */
>@@ -245,6 +271,8 @@ static int msr_device_destroy(unsigned int cpu)
>
> static char *msr_devnode(const struct device *dev, umode_t *mode)
> {
>+ if (mode)
>+ *mode = 0644;
> return kasprintf(GFP_KERNEL, "cpu/%u/msr", MINOR(dev->devt));
> }
>

I believe the preferred way to do this is to export "cooked" information in sysfs. /dev/msr really should be considered a legacy, very low-level interface suitable only for restricted environments.

When I wrote this driver, it was a maxim that "root owns the system"; there was no way to restrict privileged interfaces like /dev/(k)mem, /dev/ioports, or kernel module installation from root; any of these interfaces can be trivially used to take full command of the hardware.

This is no longer the case, and it is better for the kernel to only export known good information at the semantic level. There is no inherent guarantee that any of these registers may not contain security sensitive information in the future.

At this point, /dev/msr should be considered a debugging-only interface (for which it is quite useful still.)

So NAK on allowing any kind of access without CAP_SYS_RAWIO.

NAK on your other patch, too, because it is based on an invalid assumption: read/write to /dev/msr does *not* advance the file pointer. This is analogous to /dev/ioports, rather than /dev/mem.

Incidentally, you have touched on a potential issue here: I suspect that we should check CAP_SYS_RAWIO for each access – the old policy of "if you can open it you can give someone else a file descriptor" probably no longer makes any sense, at least without some kind of mechanism to filter access control for a specific fd. (Then again, the idiom of "open then drop privileges" might justify caching the capability.)