Re: [PATCH v4 5/7] thermal: intel: hfi: Enable notification interrupt
From: Ricardo Neri
Date: Thu Jan 13 2022 - 08:47:05 EST
On Wed, Jan 12, 2022 at 03:26:41PM -0800, Srinivas Pandruvada wrote:
> On Wed, 2022-01-12 at 20:47 +0100, Rafael J. Wysocki wrote:
> > On Sat, Jan 8, 2022 at 4:46 AM Ricardo Neri
> > <ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote:
> > > When hardware wants to inform the operating system about updates in
> > > the HFI
> > > table, it issues a package-level thermal event interrupt. For this,
> > > hardware has new interrupt and status bits in the
> > > IA32_PACKAGE_THERM_
> > > INTERRUPT and IA32_PACKAGE_THERM_STATUS registers. The existing
> > > thermal
> > > throttle driver already handles thermal event interrupts: it
> > > initializes
> > > the thermal vector of the local APIC as well as per-CPU and
> > > package-level
> > > interrupt reporting. It also provides routines to service such
> > > interrupts.
> > > Extend its functionality to also handle HFI interrupts.
> > >
> > > The frequency of the thermal HFI interrupt is specific to each
> > > processor
> > > model. On some processors, a single interrupt happens as soon as
> > > the HFI is
> > > enabled and hardware will never update HFI capabilities afterwards.
> > > On
> > > other processors, thermal and power constraints may cause thermal
> > > HFI
> > > interrupts every tens of milliseconds.
> > >
> > > To not overwhelm consumers of the HFI data, use delayed work to
> > > throttle
> > > the rate at which HFI updates are processed.
> > >
> > >
>
> [...]
>
> > > +void intel_hfi_process_event(__u64 pkg_therm_status_msr_val)
> > > +{
> > > + struct hfi_instance *hfi_instance;
> > > + int cpu = smp_processor_id();
> > > + struct hfi_cpu_info *info;
> > > + u64 new_timestamp;
> > > +
> > > + if (!pkg_therm_status_msr_val)
> > > + return;
> > > +
> > > + info = &per_cpu(hfi_cpu_info, cpu);
> > > + if (!info)
> > > + return;
> > > +
> > > + /*
> > > + * It is possible that we get an HFI thermal interrupt on
> > > this CPU
> > > + * before its HFI instance is initialized.
> Although this code can handle this situation, you can avoid this.
>
> You can call intel_hfi_online(cpu) before
> "
> l = apic_read(APIC_LVTTHMR);
> apic_write(APIC_LVTTHMR, l & ~APIC_LVT_MASKED);
> "
Indeed this could work.
intel_hfi_online() also enables the HFI. It is possible to get an HFI
interrupt as soon as we enable it. I was concerned about getting the
interrupt with APIC_LVTTHMR masked and miss it. This should not be
a problem since we will get it as soon as we unmask it.
> in thermal_throttle_online()
>
> In the same way call intel_hfi_offline(cpu)
> after
>
> /* Mask the thermal vector before draining evtl. pending work */
> l = apic_read(APIC_LVTTHMR);
> apic_write(APIC_LVTTHMR, l | APIC_LVT_MASKED);
>
> in thermal_throttle_offline()
Sure, I can do this.
>
>
> > > This is not a problem. The
> > > + * CPU that enabled the interrupt for this package will
> > > also get the
> > > + * interrupt and is fully initialized.
> > > + */
> > > + hfi_instance = info->hfi_instance;
> > > + if (!hfi_instance)
> > > + return;
> >
> > Generally, a CPU whose info has been initialized can be offline, so
> > this code may run on an offline CPU.
> >
> > I'm not actually sure if this is a concern, but just mentioning it in
> > case it is.
> >
> It will not matter as the handler of the message should be handle case
> as CPU can go offline later after the message even if the CPU was
> offline.
> But I think we can avoid this situation.
>
> > > +
> > > + /*
> > > + * On most systems, all CPUs in the package receive a
> > > package-level
> > > + * thermal interrupt when there is an HFI update. It is
> > > sufficient to
> > > + * let a single CPU to acknowledge the update and schedule
> > > work to
> > > + * process it. The remaining CPUs can resume their work.
> > > + */
> > > + if (!raw_spin_trylock(&hfi_instance->event_lock))
> > > + return;
> > > +
> > > + /* Skip duplicated updates. */
> > > + new_timestamp = *(u64 *)hfi_instance->hw_table;
> > > + if (*hfi_instance->timestamp == new_timestamp) {
> > > + raw_spin_unlock(&hfi_instance->event_lock);
> > > + return;
> > > + }
> > > +
> > > + raw_spin_lock(&hfi_instance->table_lock);
> > > +
> > > + /*
> > > + * Copy the updated table into our local copy. This
> > > includes the new
> > > + * timestamp.
> > > + */
> > > + memcpy(hfi_instance->local_table, hfi_instance->hw_table,
> > > + hfi_features.nr_table_pages << PAGE_SHIFT);
> > > +
> > > + raw_spin_unlock(&hfi_instance->table_lock);
> > > + raw_spin_unlock(&hfi_instance->event_lock);
> > > +
> > > + /*
> > > + * Let hardware know that we are done reading the HFI table
> > > and it is
> > > + * free to update it again.
> > > + */
> > > + pkg_therm_status_msr_val &= THERM_STATUS_CLEAR_PKG_MASK &
> > > + ~PACKAGE_THERM_STATUS_HFI_UPDAT
> > > ED;
> > > + wrmsrl(MSR_IA32_PACKAGE_THERM_STATUS,
> > > pkg_therm_status_msr_val);
> > > +
> > > + schedule_delayed_work(&hfi_instance->update_work,
> > > HFI_UPDATE_INTERVAL);
> >
> > AFAICS,
> For my understanding:
>
> > if update_work has been scheduled already,
> queue_delayed_work_on is called
>
> > but is not pending
> > yet, the delay will be set to the current time plus
> > HFI_UPDATE_INTERVAL, but shouldn't it actually run earlier in that
> > case?
>
>
> "
> if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
> work_data_bits(work))) {
> __queue_delayed_work(cpu, wq, dwork, delay);
> ret = true;
> }
> "
> Pending bit will be set only one time, so delay will be from
> the first call of queue_delayed_work_on() + HFI_UPDATE_INTERVAL.
>
> So on subsequent calls of schedule_delayed_work() the delay is always
> with reference to first call.
>
> >
> > Also it looks like the processing introduced in the next patch can
> > take quite a bit of time if there is a sufficiently large number of
> > CPUs in the package, so is it suitable for system_wq in all cases?
> >
> Good question. What is the threshold of not using system_wq?
> With current set of max cpus/package, I did experiments with busy
> systems with a test functions with several workqueues and check if they
> drift in expiry. But I think we can move away from system_wq. Can this
> be add on patch?
Sure, I can do this.
Thanks and BR,
Ricardo