Re: [PATCH v2] hwmon: (pmbus/ibm-cffps) Add clear_faults debugfs entry

From: Guenter Roeck
Date: Thu Mar 17 2022 - 15:14:36 EST


On 3/17/22 09:12, Brandon Wyman wrote:
On Wed, Mar 16, 2022 at 3:14 PM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:

On 3/16/22 13:03, Brandon Wyman wrote:
On Sun, Mar 13, 2022 at 11:36 PM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:

On 3/11/22 10:10, Brandon Wyman wrote:
Add a clear_faults write-only debugfs entry for the ibm-cffps device
driver.

Certain IBM power supplies require clearing some latched faults in order
to indicate that the fault has indeed been observed/noticed.


That is insufficient, sorry. Please provide the affected power supplies as
well as the affected faults, and confirm that the problem still exists
in v5.17-rc6 or later kernels - or, more specifically, in any kernel which
includes commit 35f165f08950 ("hwmon: (pmbus) Clear pmbus fault/warning
bits after read").

Thanks,
Guenter

Sorry for the delay in responding. I did some testing with commit
35f165f08950. I could not get that code to send the CLEAR_FAULTS
command to the power supplies.

I can update the commit message to be more specific about which power
supplies need this CLEAR_FAULTS sent, and which faults. It is observed
with the 1600W power supplies (2B1E model). The faults that latch are
the VIN_UV and INPUT faults in the STATUS_WORD. The corresponding
STATUS_INPUT fault bits are VIN_UV_FAULT and Unit is Off.


The point is that the respective fault bits should be reset when the
corresponding alarm attributes are read. This isn't about executing
a CLEAR_FAULTS command, but about selectively resetting fault bits
while ensuring that faults are reported at least once. Executing
CLEAR_FAULTS is a big hammer.

With the patch I pointed to in place, input (and other) faults should
be reset after the corresponding alarm attributes are read, assuming
that the condition no longer exists. If that does not happen, we should
fix the problem instead of deploying the big hammer.

Thanks,
Guenter

Okay, I see what you are pointing out there. I had been mostly looking
at the "files" in the debugfs paths. Those do not end up running
through that pmbus_get_boolean() function, so the individual fault
clearing was not being attempted. The fault I was interested in
appears to be associated with in1_lcrti_alarm. Reading that will give
me a 1 if there is a VIN_UV fault, and then it sends 0x10 to
STATUS_INPUT. That clears out VIN_UV, but the STATUS_INPUT command was
returning 0x18. Nothing appears to handle clearing BIT(3), that 0x08
mask.

Should there be some kind of define for BIT(3) over in pmbus.h?
Something like PB_VOLTAGE_OFF? Somehow we need something using that in
sbit of the attributes. I had a quick hack that just OR'ed BIT(3) with
BIT(4) for that PB_VOLTAGE_UV_FAULT. That resulted in a clear of both
bits in STATUS_INPUT, and the faults clearing in STATUS_WORD.

It is not clear if there should be a separate alarm for that "Unit Off
For Insufficient Input Voltage", or if the one for in1_lcrit_alarm
could just be the two bits OR'ed into one mask. I can send a patch
with a proposal on how to fix this one bit not getting cleared.


We don't have a separate standard attribute. I think the best approach
would be to add a mask for bit 3 and or that mask for lcrit in
vin_limit_attrs with PB_VOLTAGE_UV_FAULT. I'd suggest to name the
define something like PB_VOLTAGE_VIN_OFF or PB_VOLTAGE_VIN_FAULT
to clarify that the bit applies to the input.

Thanks,
Guenter