Re: [PATCH] hwmon: (ntc_thermistor) return error instead of clipping on OOB

From: Guenter Roeck
Date: Tue Mar 04 2025 - 10:29:11 EST


On 3/4/25 06:10, Maud Spierings via B4 Relay wrote:
From: Maud Spierings <maudspierings@xxxxxxxxxxxxxx>

When the ntc is reading Out Of Bounds instead of clipping to the nearest
limit (min/max) return -ENODATA. This prevents malfunctioning sensors
from sending a device into a shutdown loop due to a critical trip.

This implementation will only work for ntc type thermistors if a ptc
type is to be implemented the min/max ohm calculation must be adjusted
to take that into account.

Signed-off-by: Maud Spierings <maudspierings@xxxxxxxxxxxxxx>
---
This patch is a continuation of another discussion [1]. I felt like it
should be a new patch, not a v2 as this is a very different change.

I have left the clamping of n to INT_MAX in the code with a comment, but
it may be possible to find a better solution to it. One I thought of is
to make the ohm field of the ntc_compensation struct a signed int as
well. It would get rid of this weird edge case, but it doesn't make
sense to allow for negative resistances to be entered into the sensor
table.

Currently the only feedback this provides to the user is when they
manually try to read the temperature and it returns the error. I have
added a simple printk to these error points to see how spammy it gets
and this is the result:

dmesg | grep hwmon
[ 4.982682] hwmon: sensor out of bounds
[ 5.249758] hwmon: sensor out of bounds
[ 5.633729] hwmon: sensor out of bounds
[ 6.215285] hwmon: sensor out of bounds
[ 7.073882] hwmon: sensor out of bounds
[ 7.486620] hwmon: sensor out of bounds
[ 8.833765] hwmon: sensor out of bounds
[ 10.785969] hwmon: sensor out of bounds
[ 13.793722] hwmon: sensor out of bounds
[ 16.761124] hwmon: sensor out of bounds
[ 17.889706] hwmon: sensor out of bounds
[ 25.057715] hwmon: sensor out of bounds
[ 35.041725] hwmon: sensor out of bounds
[ 50.110346] hwmon: sensor out of bounds
[ 72.945283] hwmon: sensor out of bounds
[ 105.712619] hwmon: sensor out of bounds
[ 154.863976] hwmon: sensor out of bounds
[ 164.937104] hwmon: sensor out of bounds
[ 228.590909] hwmon: sensor out of bounds
[ 315.365777] hwmon: sensor out of bounds
[ 464.718403] hwmon: sensor out of bounds
[ 615.079123] hwmon: sensor out of bounds
[ 764.496780] hwmon: sensor out of bounds

This is with polling-delay set to 1000, it seems to rate-limit itself?
But I feel there should be a better way to communicate the potential
sensor failure to the user, but I can't think of anything.


Hmm ... we could add "fault" attributes and report a sensor fault
if uv == 0 or uv >= puv, together with the -ENODATA temperature reading.
That could distinguish the fault from the "resistor value out of range" case.

[1]: https://lore.kernel.org/all/20250304-ntc_min_max-v1-1-b08e70e56459@xxxxxxxxxxxxxx/

Most of the above should be after "---" since it is irrelevant for the commit log.

---
drivers/hwmon/ntc_thermistor.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/hwmon/ntc_thermistor.c b/drivers/hwmon/ntc_thermistor.c
index 0d29c8f97ba7c2f264588b6309b91ca494012ad6..311a60498d409ea068a18590415b2d5b43e73eb1 100644
--- a/drivers/hwmon/ntc_thermistor.c
+++ b/drivers/hwmon/ntc_thermistor.c
@@ -387,12 +387,9 @@ static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
puo = data->pullup_ohm;
pdo = data->pulldown_ohm;
- if (uv == 0)
- return (data->connect == NTC_CONNECTED_POSITIVE) ?
- INT_MAX : 0;
- if (uv >= puv)
- return (data->connect == NTC_CONNECTED_POSITIVE) ?
- 0 : INT_MAX;
+ /* faulty adc value */
+ if (uv == 0 || uv >= puv)
+ return -ENODATA;
if (data->connect == NTC_CONNECTED_POSITIVE && puo == 0)
n = div_u64(pdo * (puv - uv), uv);
@@ -404,6 +401,16 @@ static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
else
n = div64_u64_safe(pdo * puo * uv, pdo * (puv - uv) - puo * uv);
+ /* sensor out of bounds */
+ if (n > data->comp[0].ohm || n < data->comp[data->n_comp-1].ohm)
+ return -ENODATA;
+
+ /*
+ * theoretically data->comp[0].ohm can be greater than INT_MAX as it is an
+ * unsigned integer, but it doesn't make any sense for it to be so as the
+ * maximum return value of this function is INT_MAX, so it will never be
+ * able to properly calculate that temperature.
+ */
if (n > INT_MAX)
n = INT_MAX;

I am not a friend of theoretic code or comments like this. Please drop.
The original code was intended to catch out-of-bounds values which would
otherwise have been reported as error, not to catch unrealistic resistor
values in the compensation tables.

Thanks,
Guenter