Re: [PATCH v4 2/9] Documentation: hwmon: Add OCC documentation

From: Eddie James
Date: Thu Aug 30 2018 - 17:29:39 EST




On 07/25/2018 11:36 AM, Guenter Roeck wrote:
On Wed, Jul 11, 2018 at 04:01:31PM -0500, Eddie James wrote:
Document the hwmon interface for the OCC.

Signed-off-by: Eddie James <eajames@xxxxxxxxxxxxxxxxxx>
---
Documentation/hwmon/occ | 73 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
create mode 100644 Documentation/hwmon/occ

diff --git a/Documentation/hwmon/occ b/Documentation/hwmon/occ
new file mode 100644
index 0000000..465fa1a
--- /dev/null
+++ b/Documentation/hwmon/occ
@@ -0,0 +1,73 @@
+Kernel driver occ-hwmon
+=======================
+
+Supported chips:
+ * POWER8
+ * POWER9
+
+Author: Eddie James <eajames@xxxxxxxxxxxxxxxxxx>
+
+Description
+-----------
+
+This driver supports hardware monitoring for the On-Chip Controller (OCC)
+embedded on POWER processors. The OCC is a device that collects and aggregates
+sensor data from the processor and the system. The OCC can provide the raw
+sensor data as well as perform thermal and power management on the system.
+
+The P8 version of this driver is a client driver of I2C. It may be probed
+manually if an "ibm,p8-occ-hwmon" compatible device is found under the
+appropriate I2C bus node in the device-tree.
+
+The P9 version of this driver is a client driver of the FSI-based OCC driver.
+It will be probed automatically by the FSI-based OCC driver.
+
+Sysfs entries
+-------------
+
+The following attributes are supported. All attributes are read-only unless
+specified.
+
+temp[1-n]_label OCC sensor id.
+temp[1-n]_input Measured temperature in millidegrees C.
+[with temperature sensor version 2+]
+ temp[1-n]_fru_type Given FRU (Field Replaceable Unit) type.
What is this ? An integer ? A string ?

+ temp[1-n]_fault Temperature sensor fault.
+
+freq[1-n]_label OCC sensor id.
+freq[1-n]_input Measured frequency.
What does that have to do with hardware monitoring, and what exactly does it
measure ? AC voltage frequency ? Frequency of rainstorms in the surrounding
area ?

+
+power[1-n]_label OCC sensor id.
+power[1-n]_input Measured power in microwatts.
+power[1-n]_update_tag Number of 250us samples represented in accumulator.
update_tag to represent number of samples ? Odd choice for
an attribute name. Why not "_samples" ? Also, if each sample
represents a specific amount of time, why not report a time ?

+power[1-n]_accumulator Accumulation of 250us power readings.
There is no explanation of "accumulation". Is this the energy ?
If so, why not use energy attributes ? And what is the unit of
this measurement ?

+[with power sensor version 2+]
+ power[1-n]_function_id Identifies what the power reading is for.
String ? Number ? Slot index ? Bitmap ? And why isn't that reported
in the label ? After all, that is what the label is supposed to be
used for.

+ power[1-n]_apss_channel Indicates APSS channel.
+
Does that provide any value to the user ?

+[power version 0xa0 only]
+power1_id OCC sensor id.
This is inconsistent with the other attributes and even with itself.

+power[1-n]_label Sensor type, "system", "proc", "vdd", or "vdn".
+power[1-n]_input Most recent power reading in microwatts.
Overall I am left with no idea what
_id
_label
_function_id
_apps_channel
are and how they relate to each other, except that it all looks quite
inconsistent. You might want to consider merging all those attributes into
the label in some consistent way.

+power[1-n]_update_tag Number of samples in the accumulator.
+power[1-n]_accumulator Accumulation of power readings.
Same as above.

+[with sensor type "system" and "proc" only]
+ power[1-n]_update_time Time in us that the power value is read.
+
+caps1_current Current OCC power cap in watts.
+caps1_reading Current system output power in watts.
+caps1_norm Power cap without redundant power.
+caps1_max Maximum power cap.
Why do those have to be non-standard attributes ? Please explain why you can not
use power[1-n]_cap attributes.

+[caps version 1 and 2 only]
+ caps1_min Minimum power cap.
+[caps version 3+]
+ caps1_min_hard Hard minimum cap that can be set and held.
+ caps1_min_soft Soft minimum cap below hard, not guaranteed.
+caps1_user The powercap specified by the user. Will be 0 if no
+ user powercap exists. This attribute is read-write.
+[caps version 1+]
+ caps1_user_source Indicates how the user power limit was set.
+
+extn[1-n]_label ASCII id or sensor id.
+extn[1-n]_flags Indicates type of label attribute.
+extn[1-n]_input Data.
Great non-explanation.

Not reviewing the series further. I am sure I asked that each non-standard
attribute is explained. There is neither an explanation why the attributes
are needed nor, in many cases, why non-standard attributes were chosen
instead of standard ones. On top of that, the non-standard attributes are
not even documented properly, leaving the reader wondering not only why
they are needed, but what they are used for in the first place.

Hi,

Thanks for the feedback Guenter. I am about to put up a new patch set with fixes for many of the issues you indicated and better descriptions. Now all the attributes except two should conform to standard hwmon attributes. The exceptions are for the user power cap and user power cap source. These are needed in order to make decisions about power management while controlling the system. Please look at their documentation to see if they're acceptable.

Let me know what you think!
Thanks,
Eddie


Guenter