Re: [PATCH v9,4/7] thermal: mediatek: Add LVTS driver for mt8192 thermal zones

From: Balsam CHIHI
Date: Mon Sep 19 2022 - 12:31:43 EST


Hi Daniel,

Thank you very much for your time to do this review.
You will find below answers to your questions. And some questions from
my side too.
The comments that are not present in this email, are already implemented.
Please let me know if you have any additional questions.

> > Add LVTS v4 (Low Voltage Thermal Sensor) driver to report junction
> > temperatures in MediaTek SoC mt8192 and register the maximum temperature
> > of sensors and each sensor as a thermal zone.
>
> As it is a new driver submission, a more detailed hardware description
> of the sensors may help to have a better understanding of the driver
> implementation
>
> I think also the patch size is too big and the driver should be
> submitted with a reduced number of changes.
>
> The header file should be simplified as much as possible. As well as the
> structures.
>
> The driver initialization should be lvts_v4_init -> lvts_init, that
> means one function to init and one to exit. No other functions being
> called from lvts_v4 to lvts
>

Would you please explain this point further?

> Overall the driver is very complex and hard to review. I tried to review
> as much as possible but there is too much code. So I gave up the review
> after digging into too many details to be reworked.
>
> See below
>

[...]

> > diff --git a/drivers/thermal/mediatek/lvts_thermal.c b/drivers/thermal/mediatek/lvts_thermal.c
> > new file mode 100644
> > index 000000000000..a1681b914c69
> > --- /dev/null
> > +++ b/drivers/thermal/mediatek/lvts_thermal.c
> > @@ -0,0 +1,861 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (c) 2022 MediaTek Inc.
> > + */

[...]

> > +static void lvts_reset(struct lvts_data *lvts_data)
>
> error handling: static int ...
>
> > +{
> > + if (lvts_data->reset)
> > + reset_control_assert(lvts_data->reset);
>
> error handling
>
> > +
> > + if (lvts_data->reset)
> > + reset_control_deassert(lvts_data->reset);
>
> error handling
>
> It is the same condition, both calls can go under the same block.
>
>
> BTW, why assert + deassert ?
>

For more than one reset signal.

> > +}
> > +
> > +static void device_identification(struct lvts_data *lvts_data)
>
> error handling: static int ...
>
> > +{
> > + void __iomem *base;
> > + struct device *dev = lvts_data->dev;
> > + unsigned int i, data;
> > +
> > + for (i = 0; i < lvts_data->num_tc; i++) {
> > + base = GET_BASE_ADDR(lvts_data, i);
> > + writel(ENABLE_LVTS_CTRL_CLK, LVTSCLKEN_0 + base);
>
> Isn't it already done with the clk framework ?
>

No. this is the LVTS hardware internal clock :
The clocks of the Thermal Controllers.

> > + lvts_write_device(lvts_data, RESET_ALL_DEVICES(lvts_data), i);
>
> Why a call to RESET_*ALL*_DEVICES in the loop? I'm confused, it is
> resetting all the devices but with a 'tc_id' passed as parameter
>
> Is it possible to use the assert reset for that ?
>

“ALL” means all sensors (TS) in this Thermal Controller (TC).

> > + writel(READ_BACK_DEVICE_ID(lvts_data), LVTS_CONFIG_0 + base);
>
> What means READ_BACK?
>

Read and update the sensor in the TC.

> > + usleep_range(20, 30);
>
> Why is that needed ?
>

Wait for the hardware to be ready.
it has been reduced to the lowest possible value "usleep_range(2, 4)",
but it can't be removed : lower than that the hardware malfunctions.

> > + /* Check LVTS device ID */
> > + data = (readl(LVTS_ID_0 + base) & DEVICE_REG_DATA);
> > + if (data != (lvts_data->tc->dev_id + i))
> > + dev_err(dev, "LVTS_TC_%d, Device ID should be 0x%x, but 0x%x\n",
> > + i, (lvts_data->tc->dev_id + i), data);
>
> I'm confused, what is the purpose of the device_identification()
> function if you already know the id? When this situation can happen?
>

It’s a hardware initialization flow to identify if the correct
hardware module exists and is ready for initialization.

> > + }
> > +}

[...]

> > +static void set_hw_filter(struct lvts_data *lvts_data, int tc_id)
> > +{
> > + void __iomem *base = GET_BASE_ADDR(lvts_data, tc_id);
> > + struct device *dev = lvts_data->dev;
> > + const struct lvts_tc_settings *tc = lvts_data->tc;
> > + unsigned int option = tc[tc_id].hw_filter & 0x7;
> > +
> > + /*
> > + * hw filter
> > + * 000: Get one sample
> > + * 001: Get 2 samples and average them
> > + * 010: Get 4 samples, drop max and min, then average the rest of 2 samples
> > + * 011: Get 6 samples, drop max and min, then average the rest of 4 samples
> > + * 100: Get 10 samples, drop max and min, then average the rest of 8 samples
> > + * 101: Get 18 samples, drop max and min, then average the rest of 16 samples
> > + */
> > + option = (option << 9) | (option << 6) | (option << 3) | option;
>
> I'm missing to understand, can you explain this bit changes?
>

The 3 bits are one set defined by hardware usage.

> > + writel(option, LVTSMSRCTL0_0 + base);
> > + dev_dbg(dev, "lvts_tc_%d, LVTSMSRCTL0_0= 0x%x\n", tc_id, readl(LVTSMSRCTL0_0 + base));
> > +}
> > +
>
> This function deserves an explanation. Why a dominator index exists?
>

The dominator index is the Thermal Sensor that will be used as a
sensing point to reboot the HW in critical temperature.
This function sets a thermal sensor for that purpose.

[...]

> > +static void disable_hw_reboot_interrupt(struct lvts_data *lvts_data, int tc_id)
> > +{
> > + void __iomem *base = GET_BASE_ADDR(lvts_data, tc_id);
> > + unsigned int temp;
> > +
> > + /*
> > + * LVTS thermal controller has two interrupts for thermal HW reboot.
> > + * One is for AP SW and the other is for RGU.
>
> May be reword "thermal controller" to something like "thermal domain" or
> whatever. It is confusing with the thermal controller contained in
> lvts_data.
>
> > + * The interrupt of AP SW can be turned off by a bit of a register,
> > + * but the other for RGU cannot.
> > + * To prevent rebooting device accidentally, we are going to add
> > + * a huge offset 0x3FFF to LVTS and make it always report extremely low temperature.
> > + * LVTS always adds the offset 0x3FFF to MSR_RAW.
> > + * When MSR_RAW is larger, SW will convert lower temperature.
>
> I'm not sure to fully understand the explanation and the goal.
>

Hardware constraint.

> > + */ > + temp = readl(LVTSPROTCTL_0 + base);
> > + writel(temp | 0x3FFF, LVTSPROTCTL_0 + base);
> > +
> > + /* Disable the interrupt of AP SW */
> > + temp = readl(LVTSMONINT_0 + base);
> > + writel(temp & ~(STAGE3_INT_EN), LVTSMONINT_0 + base);
> > +}

[...]

> > +static void set_tc_hw_reboot_threshold(struct lvts_data *lvts_data, int trip_point, int tc_id)
> > +{
> > + void __iomem *base = GET_BASE_ADDR(lvts_data, tc_id);
> > + struct device *dev = lvts_data->dev;
> > + unsigned int msr_raw, temp, config, d_index;
> > +
> > + d_index = get_dominator_index(lvts_data, tc_id);
> > +
> > + dev_dbg(dev, "lvts_tc_%d: dominator sensing point = %d\n", tc_id, d_index);
> > +
> > + disable_hw_reboot_interrupt(lvts_data, tc_id);
> > + temp = readl(LVTSPROTCTL_0 + base);
> > + if (d_index == ALL_SENSING_POINTS) {
> > + /* Maximum of 4 sensing points */
> > + config = (0x1 << 16);
> > + writel(config | temp, LVTSPROTCTL_0 + base);
> > +
> > + } else {
> > + /* Select protection sensor */
> > + config = ((d_index << 2) + 0x2) << 16;
> > + writel(config | temp, LVTSPROTCTL_0 + base);
> > + }
>
> So if ALL_SENSING_POINTS is set, the hardware is able to reboot on the
> first sensors reaching the hw reboot temp, otherwise it is the specified
> sensor which triggers the hw reboot? Do I read it correctly?
>

Yes. Will reboot on the first sensors reaching the hardware reboot
temp. But tc need have 4 sensors.

> Why not use the ALL_SENSING_POINTS for all the cases?
>

Not all TCs have 4 sensors.
And ALL_SENSING_POINTS need more hardware resource/time to check 4
sensors. So we use one sensor for usage.

> > + msr_raw = lvts_temp_to_raw(&lvts_data->coeff, trip_point);
> > + writel(msr_raw, LVTSPROTTC_0 + base);
> > + enable_hw_reboot_interrupt(lvts_data, tc_id);
> > +}

[...]

> > +static void tc_irq_handler(struct lvts_data *lvts_data, int tc_id)
> > +{
> > + void __iomem *base = GET_BASE_ADDR(lvts_data, tc_id);
> > + const struct device *dev = lvts_data->dev;
> > + unsigned int ret = readl(LVTSMONINTSTS_0 + base);
> > +
> > + /* Write back to clear interrupt status */
> > + writel(ret, LVTSMONINTSTS_0 + base);
> > + dev_dbg(dev, "LVTS thermal controller %d, LVTSMONINTSTS=0x%08x\n", tc_id, ret);
> > + if (ret & THERMAL_PROTECTION_STAGE_3)
>
> That means ?
>

It’s hardware design for clearing the interrupt status. (write
register to clear interrupt status)

> > + dev_dbg(dev, "Thermal protection stage 3 interrupt triggered\n");
>
> When do we notify the thermal zone an interrupt happened because of a
> trip point?
>

Just to print log to let the software know that the highest/reboot
temperature has happened.

> > +}
> > +
> > +static irqreturn_t irq_handler(int irq, void *dev_id)
> > +{
> > + void __iomem *base;
> > + struct lvts_data *lvts_data = (struct lvts_data *)dev_id;
> > + struct device *dev = lvts_data->dev;
> > + const struct lvts_tc_settings *tc = lvts_data->tc;
> > + unsigned int i, irq_bitmap;
> > +
> > + base = lvts_data->base;
> > + irq_bitmap = readl(THERMINTST + base);
> > + dev_dbg(dev, "THERMINTST = 0x%x\n", irq_bitmap);
>
> IIUC, there is one interrupt happening when a temperature threshold is
> crossed but we don't know which sensor it is. So we retrieve the bit
> mask of interrupt pending, right ? ...
>

The interrupt belongs to TC.

> > + for (i = 0; i < lvts_data->num_tc; i++) {
> > + if (tc[i].irq_bit == 0)
>
> ... but then what do we do with this bitmap ?
>

The bitmap is used to store values of AP and MCU.

> > + tc_irq_handler(lvts_data, i);
> > + }
> > +
> > + return IRQ_HANDLED;
> > +}

[...]

> > diff --git a/drivers/thermal/mediatek/lvts_thermal.h b/drivers/thermal/mediatek/lvts_thermal.h
> > new file mode 100644
> > index 000000000000..a94ce46acccd
> > --- /dev/null
> > +++ b/drivers/thermal/mediatek/lvts_thermal.h
> > @@ -0,0 +1,385 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (c) 2022 MediaTek Inc.
> > + */

[...]

> > +
> > +struct lvts_data;
>
> This forward declaration should not be needed
>

This is needed because "struct platform_ops" uses "struct lvts_data" as args.
removing it causes a compilation error.
and "struct lvts_data" also uses "struct platform_ops" as a member.
Do you have any suggestions on removing this forward declaration, if
it must be removed?

[...]

> > +struct lvts_tc_settings {
> > + unsigned int dev_id;
> > + unsigned int addr_offset;
> > + unsigned int num_sensor;
> > + unsigned int ts_offset;
> > + unsigned int sensor_map[ALL_SENSING_POINTS]; /* In sensor ID */
>
> Can you explain what is for the sensor_map?
>

A controller contains up to 4 sensors.
Example :
...
.sensor_map = {MT8195_TS1_0, MT8195_TS1_1},
...
.sensor_map = {MT8195_TS3_0, MT8195_TS3_1, MT8195_TS3_2, MT8195_TS3_3},

> > + struct lvts_speed_settings *tc_speed;
> > + /*
> > + * HW filter setting
> > + * 000: Get one sample
> > + * 001: Get 2 samples and average them
> > + * 010: Get 4 samples, drop max and min, then average the rest of 2 samples
> > + * 011: Get 6 samples, drop max and min, then average the rest of 4 samples
> > + * 100: Get 10 samples, drop max and min, then average the rest of 8 samples
> > + * 101: Get 18 samples, drop max and min, then average the rest of 16 samples
> > + */
> > + unsigned int hw_filter;
>
> Isn't possible to specify a latency constraint instead of a hw_filter
> and deduce this one?
>

These are hardware pre-defined settings.

> > + /*
> > + * Dominator_sensing point is used to select a sensing point
> > + * and reference its temperature to trigger Thermal HW Reboot
> > + * When it is ALL_SENSING_POINTS, it will select all sensing points
> > + */
> > + int dominator_sensing_point;
>
> replace int by the enum, hopefully 'clang' can detect enum mismatches
>
> > + int hw_reboot_trip_point; /* -274000: Disable HW reboot */
>
> It is not necessary to specify the hw_reboot_trip_point, it is all the
> same in the declaration.
>

Because new chips in the future may have different settings for
different modules,
in order to maintain flexibility, it is recommended to keep
hw_reboot_trip_point in the struct

> > + unsigned int irq_bit;
> > +};

[...]

> > +
> > +struct lvts_data {
>
> IMO it would make sense to rename the lvts_data to lvts_thermal_domain
>
> > + struct device *dev;
> > + struct clk *clk;
> > + void __iomem *base; /* LVTS base addresses */
> > + unsigned int irq_num; /* LVTS interrupt numbers */
>
> The probe function can be reworked to get the irq and the pass it the
> register function. No need to store it in this structure.
>

Still working on it...
But, it will be good to give me a little help.

> > + struct reset_control *reset;
> > + int num_tc; /* Number of LVTS thermal controllers */
> > + const struct lvts_tc_settings *tc;
> > + int counting_window_us; /* LVTS device counting window */
> > + int num_sensor; /* Number of sensors in this platform */
> > + void __iomem **reg;
> > + struct platform_ops ops;
> > + int feature_bitmap; /* Show what features are enabled */
> > + unsigned int num_efuse_addr;
> > + unsigned int *efuse;
> > + unsigned int num_efuse_block; /* Number of contiguous efuse indexes */
> > + struct lvts_sensor_cal_data cal_data;
>
> Everything related to the initialization must be somehow removed from
> this structure.
>

Still working on it...
But, it will be good to give me a little help.

> > + struct lvts_formula_coeff coeff;
> > +};

[...]

Best regards,
Balsam