Re: [PATCH V10 4/4] thermal: qcom: add support for PMIC5 Gen3 ADC thermal monitoring

From: Daniel Lezcano

Date: Thu Apr 16 2026 - 17:12:25 EST

On 4/16/26 10:05, Jishnu Prakash wrote:

Hi Daniel,

On 4/9/2026 11:42 AM, Daniel Lezcano wrote:

On Fri, Jan 30, 2026 at 05:24:21PM +0530, Jishnu Prakash wrote:

Add support for ADC_TM part of PMIC5 Gen3.

This is an auxiliary driver under the Gen3 ADC driver, which implements the
threshold setting and interrupt generating functionalities of QCOM ADC_TM
drivers, used to support thermal trip points.

Signed-off-by: Jishnu Prakash <jishnu.prakash@xxxxxxxxxxxxxxxx>

...

+
+static irqreturn_t adctm5_gen3_isr(int irq, void *dev_id)
+{
+ struct adc_tm5_gen3_chip *adc_tm5 = dev_id;
+ int ret, sdam_num;
+ u8 tm_status[2];
+ u8 status, val;
+
+ sdam_num = get_sdam_from_irq(adc_tm5, irq);
+ if (sdam_num < 0) {
+ dev_err(adc_tm5->dev, "adc irq %d not associated with an sdam\n",
+ irq);
+ return IRQ_HANDLED;
+ }
+
+ ret = adc5_gen3_read(adc_tm5->dev_data, sdam_num, ADC5_GEN3_STATUS1,
+ &status, sizeof(status));
+ if (ret) {
+ dev_err(adc_tm5->dev, "adc read status1 failed with %d\n", ret);
+ return IRQ_HANDLED;
+ }
+
+ if (status & ADC5_GEN3_STATUS1_CONV_FAULT) {
+ dev_err_ratelimited(adc_tm5->dev,
+ "Unexpected conversion fault, status:%#x\n",
+ status);
+ val = ADC5_GEN3_CONV_ERR_CLR_REQ;
+ adc5_gen3_status_clear(adc_tm5->dev_data, sdam_num,
+ ADC5_GEN3_CONV_ERR_CLR, &val, 1);
+ return IRQ_HANDLED;
+ }
+
+ ret = adc5_gen3_read(adc_tm5->dev_data, sdam_num, ADC5_GEN3_TM_HIGH_STS,
+ tm_status, sizeof(tm_status));
+ if (ret) {
+ dev_err(adc_tm5->dev, "adc read TM status failed with %d\n", ret);
+ return IRQ_HANDLED;
+ }
+
+ if (tm_status[0] || tm_status[1])
+ schedule_work(&adc_tm5->tm_handler_work);
+
+ dev_dbg(adc_tm5->dev, "Interrupt status:%#x, high:%#x, low:%#x\n",
+ status, tm_status[0], tm_status[1]);
+
+ return IRQ_HANDLED;

This ISR routine should be revisited:

- no error message inside

I'll drop all the error messages, but does that also include the debug print at the end?
In addition, the print for conversion fault is ratelimited and may be useful as it
indicates a possible HW issue, can I keep that?

It is not a good practice to put an error message in the ISR. If the conversion fails, then the thread blocked on the read will timeout and then show a message.

- use a shared interrupt to split what is handled by the ADC and the
TM drivers

I'll make the required updates in the main ADC driver and this driver to share the first
SDAM's interrupt.

- do not return IRQ_HANDLED in case of error (cf. irqreturn.h doc)

I'll replace IRQ_HANDLED with IRQ_NONE at places where errors are returned.
But in the case of conversion fault, I think returning IRQ_HANDLED may be
more appropriate because we do handle it by clearing the status, to
allow subsequent conversion requests to be sent.

What do you think, is this fine?

It is a good point.

Actually, if get_sdam_from_irq() or adc5_gen3_read() fail, they will return without clearing the interrupt flag, so we should potentially end up in an infinite loop.

So the status should be cleared at the end with IRQ_HANDLED. IRQ_NONE returned if it is for another subsystem.

If you think there can be a significant number of errors in the handler may be you should add statistics but later in an additional series if it makes sense.

[ ... ]

+ adc_tm5 = prop->chip;
+
+ if (prop->last_temp_set) {
+ pr_debug("last_temp: %d\n", prop->last_temp);
+ prop->last_temp_set = false;
+ *temp = prop->last_temp;
+ return 0;
+ }

Why do you need to do that?

The temperature should reflect the current situation even if the
reading was triggered by a thermal trip violation.

This logic is needed to handle a corner case issue we have seen earlier.
In this case, the ADC_TM threshold violation interrupt gets triggered ,
but when get_temp() is subsequently called by the thermal framework, the
temperature has fluctuated and the value read now lies within the thresholds,
so the thresholds do not get updated by the thermal framework and the violation
interrupts get repeated several times, until there is a get_temp() call
which returns a temperature outside the threshold range.

Oh, that's clearly an issue with the thermal framework, not the driver.

In order to avoid this issue, when the interrupt handler runs, we find the actual
temperature read in ADC_TM that led to threshold violation by reading the ADC_TM
data registers and we cache it and return it when get_temp() is called in the flow
of thermal_zone_device_update(). Any subsequent calls to get_temp() would
return the actual channel temperature at the time.

This is only done to avoid delaying thermal mitigation due to temperature
fluctuations. Do you think this needs to be changed?

I think it is an interesting problem certainly impacting all thermal sensors. It should be fixed in the thermal framework itself if possible. Just drop this portion of code and let's handle that correctly in the thermal framework.

[ ... ]

+ dev_dbg(adc_tm5->dev, "channel:%s, low_temp(mdegC):%d, high_temp(mdegC):%d\n",
+ prop->common_props.label, low_temp, high_temp);
+
+ guard(adc5_gen3)(adc_tm5);
+ if (high_temp == INT_MAX && low_temp == -INT_MAX)
+ return adc_tm5_gen3_disable_channel(prop);

Why disable the channel instead of returning an errno ?

This is the convention we follow in our existing ADC_TM driver at
drivers/thermal/qcom/qcom-spmi-adc-tm5.c. If both upper and lower
thresholds are meant to be disabled, we disable the channel fully
in HW to save some power and it can be enabled later if this API
is called for it with valid thresholds.

Is it considered invalid in the thermal framework to try to disable
both thresholds? Should I both disable the channel and return some
error from here?

Well, if the channel is disabled, then the temperature sensor of the thermal zone is disabled, consequently the thermal zone is disabled from a HW POV but enabled from the kernel POV.

Why not add the 'change_mode' ops and then disable the thermal zone (+ pm_runtime) ?

[ ... ]

+ /*
+ * Skipping first SDAM IRQ as it is requested in parent driver.
+ * If there is a TM violation on that IRQ, the parent driver calls
+ * the notifier (adctm_event_handler) exposed from this driver to handle it.
+ */
+ for (i = 1; i < adc_tm5->dev_data->num_sdams; i++) {
+ ret = devm_request_threaded_irq(dev,
+ adc_tm5->dev_data->base[i].irq,
+ NULL, adctm5_gen3_isr, IRQF_ONESHOT,
+ adc_tm5->dev_data->base[i].irq_name,
+ adc_tm5);

The threaded interrupts set the isr in a thread and from the thread
handling the event, there is a work queue scheduled. Why not use the
top and bottom halves of the threaded interrupt ? Hopefully you should
be able to remove the lock.

Yes, I can use the top and bottom halves of the threaded interrupt as you
suggested. But what exactly do you mean by removing the lock?

If you meant the mutex lock used in this driver, we cannot remove that.
This is because the ADC_TM driver needs to write into several registers
shared with the main ADC driver for setting new thresholds, so we
have to share a mutex between the drivers to prevent concurrency issues.

When using a workqueue tampering with registers while an interrupt handler is doing the same, the lock is needed.

But if the workqueue is replaced by threaded interrupt, the lock *may* not be needed because the design may prevent race conditions.

That may be not true in this case, I did not investigate deeper in the code to figure it out. Let's see the next version

I'll address all your other comments too in the next version of this patch.

Thanks

-- Daniel