Re: [PATCH] iio: dht11: set debug log level for parsing error messages

From: George Stark
Date: Fri Apr 05 2024 - 16:07:00 EST


Hello Harald

On 3/26/24 17:03, Harald Geyer wrote:
Hi George!

Am Montag, dem 25.03.2024 um 23:18 +0300 schrieb George Stark:
On 3/25/24 21:48, Harald Geyer wrote:
Am Montag, dem 25.03.2024 um 19:54 +0300 schrieb George Stark:
Protocol parsing errors could happen due to several reasons like
noise
environment, heavy load on system etc. If to poll the sensor
frequently
and/or for a long period kernel log will become polluted with
error
messages if their log level is err (i.e. on by default).

Yes, these error are often recoverable. (As are many other HW
errors,
that typically are logged. Eg USB bus resets due to EMI)

[...]

The idea is, that these messages help users understand issues with
their HW (like too long cables, broken cables etc). But it is true,
that they will slowly accumulate in many real world scenarios
without
anything being truly wrong.
I agree with you that it's very convenient to just take a look to
dmesg
and see device connection problems at once. But unlike e.g. usb user
has
to actually start reading sensor to perform communication and read
errors will be propagated to the userspace and could be noticed \
handled.

Not really. The log lines contain additional information useful for
understanding the problem with the setup.

Anyway I believe we should use uniform approach for read errors -
currently in the driver there're already dbg messages:

"lost synchronisation at edge %d\n"
"invalid checksum\n"

These errors are usually caused by EMI and there isn't much to do aside
from trying again until we find a time window with less interference.
They are not logged, because in some cases they might be very frequent
and can be handled by the user space client programatically anyway.

I changed log level from err to dbg for the messages:

"Only %d signal edges detected\n"

This mostly indicates a problem with the setup. Long cable, dead
sensor, high (interrupt) load etc.

Its true that this can happen during normal operation. - Usually when
the system takes too long to enter the irq handler.
Usually there's nothing to do about it. And several late-handled
interrupts can lead to any of 4 errors.


But the primary causes are:
1) Your wiring is broken. In this case, the message is immediately
helpful and points you in the right direction. (Only if you understand
the protocol though.)
2) Your sensor is dead or "crashed", which also warrants an error msg
IMO.

My sensor works around 3 years 24\7 (with one reboot per month) and still precise (for home usage at least) - I compared it with several other home temperature sensors.

I've just checked another sensor in my setup (3.3v, cable 15cm) the
error rate is the same. So it's ether EMI or single-core slow cpu.


The "crashed" case is a bit special. Some chips seem to randomly stop
working after a couple of hours and the only remedy is to power cycle
them. This could be done automatically. - I have the sensor power
supply pin on a GPIO and reset it from userspace in my setup. I tried
to work on a version of the driver some years ago, that would
optionally register with a regulator and manage sensor resets from
within the kernel driver. If this was actually implemented, we could
reduce the logging to cases, where the reset didn't solve the problem.

I stopped working on this, because it would have required changes to
the regulator framework, to be actually useful, and the regulator
maintainers didn't seem to keen about them. However, if you want to
pick this up in an effort to reduce unnecessary error conditions and
messages, I certainly would be happy.

"Don't know how to decode data: %d %d %d %d\n"

This would indicate a sensor, that uses the same protocol but an
unsupported data format. This is a permanent error and therefor should
be logged IMO.

I guess, if you have a bad readout due to EMI but the checksum
accidentally matches, then you might get this message too. But this
should be a very rare case.

It's not very rare but sure not permanent. It can happen once per several hundreds / a thousand requests.


They all are from a single callback and say the same thing -
communication problem.

Not really. See above.

If we make all those messages as errors it'd be great to have
mechanism
to disable them e.g. thru module parameter or somehow without
rebuilding
kernel.

No. What you try to change is cosmetic at best. It certainly doesn't
justify adding any complexity.

Since Jonathan deferred to my judgment:

As you can see, I did consider the trade-off between useful diagnostics
and spamming the log carefully. So naturally I'm inclined to reject
your proposal unless it solves an actual problem.

Also people still mail me directly with bogus bug reports about the
driver when really they have some issue with their setup. I fear, if we
reduce diagnostics, it will increase that noise.

If you don't ask those people to send dmesg with CONFIG_DEBUG is on you lost 2 of 4 possible errors and don't see the whole picture. Other dev_dbg messages in this driver are rather helpful.

Anyway thanks for that very interesting information. It's not obvious at all if you're unfamiliar with protocol and the driver. Probably it should be put in some documentation. Without diving into driver's source code those messages are just noise.
Still I consider all those cases are purely debug. When there no hardware support for sensor's protocol and the kernel is not realtime then errors are probable. But often a temp sensor application is
not required to struggle for every measurement. And it's not desirable
to add logrotate to embedded system just because of temp sensor diagnostics.


So I reject your proposed changes, if they are for the sake of
unification. I'm willing to discuss, what the most sensible trade-off
is, but it would need to actually add to the considerations I already
did.

Best regards,
Harald


Those errors can be bypassed by increasing read rate.


I don't consider the dmesg buffer being rotated after a month or
two a
bug. But I suppose this is a corner case. I'll happily accept
whatever
Jonathan thinks is reasonable.

Best regards,
Harald


Signed-off-by: George Stark <gnstark@xxxxxxxxxxxxxxxxx>
---
I use DHT22 sensor with Raspberry Pi Zero W as a simple home
meteo
station.
Even if to poll the sensor once per tens of seconds after month
or
two dmesg
may become full of useless parsing error messages. Anyway those
errors are caught
in the user software thru return values.

  drivers/iio/humidity/dht11.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iio/humidity/dht11.c
b/drivers/iio/humidity/dht11.c
index c97e25448772..e2cbc442177b 100644
--- a/drivers/iio/humidity/dht11.c
+++ b/drivers/iio/humidity/dht11.c
@@ -156,7 +156,7 @@ static int dht11_decode(struct dht11 *dht11,
int
offset)
                 dht11->temperature = temp_int * 1000;
                 dht11->humidity = hum_int * 1000;
         } else {
-               dev_err(dht11->dev,
+               dev_dbg(dht11->dev,
                         "Don't know how to decode data: %d %d %d
%d\n",
                         hum_int, hum_dec, temp_int, temp_dec);
                 return -EIO;
@@ -239,7 +239,7 @@ static int dht11_read_raw(struct iio_dev
*iio_dev,
  #endif

                 if (ret == 0 && dht11->num_edges <
DHT11_EDGES_PER_READ - 1) {
-                       dev_err(dht11->dev, "Only %d signal edges
detected\n",
+                       dev_dbg(dht11->dev, "Only %d signal edges
detected\n",
                                 dht11->num_edges);
                         ret = -ETIMEDOUT;
                 }
--
2.25.1





--
Best regards
George