Re: [PATCH] hwmon: (ntc_thermistor) return error instead of clipping on OOB

From: Maud Spierings | GOcontroll
Date: Thu Mar 06 2025 - 08:36:09 EST


From: Guenter Roeck <groeck7@xxxxxxxxx> on behalf of Guenter Roeck <linux@xxxxxxxxxxxx>
Sent: Tuesday, March 4, 2025 4:04 PM
 
>On 3/4/25 06:10, Maud Spierings via B4 Relay wrote:
>> From: Maud Spierings <maudspierings@xxxxxxxxxxxxxx>
>>
>> When the ntc is reading Out Of Bounds instead of clipping to the nearest
>> limit (min/max) return -ENODATA. This prevents malfunctioning sensors
>> from sending a device into a shutdown loop due to a critical trip.
>>
>> This implementation will only work for ntc type thermistors if a ptc
>> type is to be implemented the min/max ohm calculation must be adjusted
>> to take that into account.
>>
>> Signed-off-by: Maud Spierings <maudspierings@xxxxxxxxxxxxxx>
>> ---
>> This patch is a continuation of another discussion [1]. I felt like it
>> should be a new patch, not a v2 as this is a very different change.
thoughts about the theoretic code comment here.
>> I have left the clamping of n to INT_MAX in the code with a comment, but
>> it may be possible to find a better solution to it. One I thought of is
>> to make the ohm field of the ntc_compensation struct a signed int as
>> well. It would get rid of this weird edge case, but it doesn't make
>> sense to allow for negative resistances to be entered into the sensor
>> table.
>>
>> Currently the only feedback this provides to the user is when they
>> manually try to read the temperature and it returns the error. I have
>> added a simple printk to these error points to see how spammy it gets
>> and this is the result:
>>
>> dmesg | grep hwmon
>> [    4.982682] hwmon: sensor out of bounds
>> [    5.249758] hwmon: sensor out of bounds
>> [    5.633729] hwmon: sensor out of bounds
>> [    6.215285] hwmon: sensor out of bounds
>> [    7.073882] hwmon: sensor out of bounds
>> [    7.486620] hwmon: sensor out of bounds
>> [    8.833765] hwmon: sensor out of bounds
>> [   10.785969] hwmon: sensor out of bounds
>> [   13.793722] hwmon: sensor out of bounds
>> [   16.761124] hwmon: sensor out of bounds
>> [   17.889706] hwmon: sensor out of bounds
>> [   25.057715] hwmon: sensor out of bounds
>> [   35.041725] hwmon: sensor out of bounds
>> [   50.110346] hwmon: sensor out of bounds
>> [   72.945283] hwmon: sensor out of bounds
>> [  105.712619] hwmon: sensor out of bounds
>> [  154.863976] hwmon: sensor out of bounds
>> [  164.937104] hwmon: sensor out of bounds
>> [  228.590909] hwmon: sensor out of bounds
>> [  315.365777] hwmon: sensor out of bounds
>> [  464.718403] hwmon: sensor out of bounds
>> [  615.079123] hwmon: sensor out of bounds
>> [  764.496780] hwmon: sensor out of bounds
>>
>> This is with polling-delay set to 1000, it seems to rate-limit itself?
>> But I feel there should be a better way to communicate the potential
>> sensor failure to the user, but I can't think of anything.
>>
>
>Hmm ... we could add "fault" attributes and report a sensor fault
>if uv == 0 or uv >= puv, together with the -ENODATA temperature reading.
>That could distinguish the fault from the "resistor value out of range" case.

I've done some working around that fault attribute, but I can't seem to
find a satisfactory way to implement it. This fault also is not triggered
if anything is disconnected, maybe the 0 case but I can't even get that.

I think leaving it at this current solution is good for now.

>> [1]: https://lore.kernel.org/all/20250304-ntc_min_max-v1-1-b08e70e56459@xxxxxxxxxxxxxx/
>
>Most of the above should be after "---" since it is irrelevant for the commit log.

I believe my cover letter seperated it correctly after my signed-off-by
tag.

>> ---
>>   drivers/hwmon/ntc_thermistor.c | 19 +++++++++++++------
>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/hwmon/ntc_thermistor.c b/drivers/hwmon/ntc_thermistor.c
>> index 0d29c8f97ba7c2f264588b6309b91ca494012ad6..311a60498d409ea068a18590415b2d5b43e73eb1 100644
>> --- a/drivers/hwmon/ntc_thermistor.c
>> +++ b/drivers/hwmon/ntc_thermistor.c
>> @@ -387,12 +387,9 @@ static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
>>        puo = data->pullup_ohm;
>>        pdo = data->pulldown_ohm;
>>  
>> -     if (uv == 0)
>> -             return (data->connect == NTC_CONNECTED_POSITIVE) ?
>> -                     INT_MAX : 0;
>> -     if (uv >= puv)
>> -             return (data->connect == NTC_CONNECTED_POSITIVE) ?
>> -                     0 : INT_MAX;
>> +     /* faulty adc value */
>> +     if (uv == 0 || uv >= puv)
>> +             return -ENODATA;
>>  
>>        if (data->connect == NTC_CONNECTED_POSITIVE && puo == 0)
>>                n = div_u64(pdo * (puv - uv), uv);
>> @@ -404,6 +401,16 @@ static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
>>        else
>>                n = div64_u64_safe(pdo * puo * uv, pdo * (puv - uv) - puo * uv);
>>  
>> +     /* sensor out of bounds */
>> +     if (n > data->comp[0].ohm || n < data->comp[data->n_comp-1].ohm)
>> +             return -ENODATA;
>> +
>> +     /*
>> +      * theoretically data->comp[0].ohm can be greater than INT_MAX as it is an
>> +      * unsigned integer, but it doesn't make any sense for it to be so as the
>> +      * maximum return value of this function is INT_MAX, so it will never be
>> +      * able to properly calculate that temperature.
>> +      */
>>        if (n > INT_MAX)
>>                n = INT_MAX;
>
>I am not a friend of theoretic code or comments like this. Please drop.
>The original code was intended to catch out-of-bounds values which would
>otherwise have been reported as error, not to catch unrealistic resistor
>values in the compensation tables.

So, drop the check and comment? Or just drop the comment? I was thinking
to fully remove that check in an earlier comment in my cover letter.

Kind regards,
Maud