Re: [patch 04/12] clockevent unbind: use smp_call_func_single_fail

From: Marcelo Tosatti
Date: Wed Feb 07 2024 - 08:13:10 EST


On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote:
> On Tue, Feb 06 2024 at 15:49, Marcelo Tosatti wrote:
> > Convert clockevents_unbind from smp_call_function_single
> > to smp_call_func_single_fail, which will fail in case
> > the target CPU is tagged as block interference CPU.
>
> Seriously? This is a root only operation. So if root wants to interfere
> then so be it.

Hi Thomas!

OK, so the problem is the following: due to software complexity, one is
often not aware of all operations that might take place.

For example:

https://lore.kernel.org/lkml/Y6mXDUZkII5OnuE8@tpad/T/

[PATCH] hwmon: coretemp: avoid RDMSR interruptions to isolated CPUs

The coretemp driver uses rdmsr_on_cpu calls to read
MSR_IA32_PACKAGE_THERM_STATUS/MSR_IA32_THERM_STATUS registers,
which contain information about current core temperature.

For certain low latency applications, the RDMSR interruption exceeds
the applications requirements.

So disable reading of crit_alarm and temp files via /sys, in case
CPU isolation is enabled.

Temperature information from the housekeeping cores should be
sufficient to infer die temperature.

Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx>

diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c
index 9bee4d33fbdf..30a35f4130d5 100644
--- a/drivers/hwmon/coretemp.c
+++ b/drivers/hwmon/coretemp.c
@@ -27,6 +27,7 @@
#include <asm/msr.h>
#include <asm/processor.h>
#include <asm/cpu_device_id.h>
+#include <linux/sched/isolation.h>

#define DRVNAME "coretemp"

@@ -121,6 +122,10 @@ static ssize_t show_crit_alarm(struct device *dev,
struct platform_data *pdata = dev_get_drvdata(dev);
struct temp_data *tdata = pdata->core_data[attr->index];

+
+ if (!housekeeping_cpu(tdata->cpu, HK_TYPE_MISC))
+ return -EINVAL;
+
mutex_lock(&tdata->update_lock);
rdmsr_on_cpu(tdata->cpu, tdata->status_reg, &eax, &edx);
mutex_unlock(&tdata->update_lock);
@@ -158,6 +163,8 @@ static ssize_t show_temp(struct device *dev,

/* Check whether the time interval has elapsed */
if (!tdata->valid || time_after(jiffies, tdata->last_updated + HZ)) {
+ if (!housekeeping_cpu(tdata->cpu, HK_TYPE_MISC))
+ return -EINVAL;
rdmsr_on_cpu(tdata->cpu, tdata->status_reg, &eax, &edx);
/*
* Ignore the valid bit. In all observed cases the register


In this case, a userspace application (collecting telemetry data), was
responsible for reading the sysfs files.

Now think of all possible paths, from userspace, that lead to kernel
code that ends up in smp_call_function_* variants (or other functions
that cause IPIs to isolated CPUs).

The alternative, from blocking this in the kernel, would be to validate all
userspace software involved in your application, to ensure it won't end
up in the kernel sending IPIs. Which is impractical, isnt it ?
(or rather, with such option in the kernel, it would be possible to run
applications which have not been validated, since the kernel would fail
the operation that results in IPI to isolated CPU).

So the idea would be an additional "isolation mode", which when enabled,
would disallow the IPIs. Its still possible for root user to disable
this mode, and retry the operation.

So lets say i want to read MSRs on a given CPU, as root.

You'd have to:

1) readmsr on given CPU (returns -EPERM or whatever), since the
"block interference" mode is enabled for that CPU.

2) Disable that CPU in the block interference cpumask.

3) readmsr on the given CPU (success).

4) Re-enable CPU in block interference cpumask, if desired.


(BTW, better ideas than the cpumask are welcome, but it seems the
realibility aspect of something similar to this is necessary).

Thanks!