Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

From: Daniel Lezcano
Date: Fri Feb 23 2018 - 06:29:03 EST


On 23/02/2018 08:34, Viresh Kumar wrote:
> On 21-02-18, 16:29, Daniel Lezcano wrote:
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index 5c219dc..9340216 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -10,18 +10,32 @@
>> * Viresh Kumar <viresh.kumar@xxxxxxxxxx>
>> *
>> */
>> +#undef DEBUG
>
> Why is this required ?

It is usually added, so if you set the -DDEBUG flag when compiling, you
don't get all the pr_debug traces for all files, but the just the ones
where you commented the #undef above. pr_debug is a no-op otherwise.

>> +#define pr_fmt(fmt) "CPU cooling: " fmt
>
> I think you can use the dev_***() routines instead, as you can
> directly the CPU device from anywhere.

Can we postpone this change for later ? All the file is using pr_*
(cpufreq_cooling included). There is only one place where dev_err is
used but it is removed by the patch 3/7.

[ ... ]

>> +#ifdef CONFIG_CPU_IDLE_THERMAL
>> +/*
>> + * The idle duration injection. As we don't have yet a way to specify
>> + * from the DT configuration, let's default to a tick duration.
>> + */
>> +#define DEFAULT_IDLE_TIME_US TICK_USEC
>
> I think this macro is a bit unnecessary here. Its used only at
> initialization and so the same comment can be present there and you
> can use TICK_USEC there.
>
> Else, Keep it as it is and remove the "idle_cycle" field from the
> below structure, as it holds a constant forever.

Yes, makes sense.

>> +/**
>> + * struct cpuidle_cooling_device - data for the idle cooling device
>> + * @cdev: a pointer to a struct thermal_cooling_device
>> + * @cpumask: a cpumask containing the CPU managed by the cooling device
>
> Missed @node here.

Ok.

>> + * @timer: a hrtimer giving the tempo for the idle injection cycles
>> + * @kref: a kernel refcount on this structure
>> + * @count: an atomic to keep track of the last task exiting the idle cycle
>> + * @idle_cycle: an integer defining the duration of the idle injection
>> + * @state: an normalized integer giving the state of the cooling device
>> + */
>> +struct cpuidle_cooling_device {
>> + struct thermal_cooling_device *cdev;
>> + struct cpumask *cpumask;
>> + struct list_head node;
>> + struct hrtimer timer;
>> + struct kref kref;
>> + atomic_t count;
>> + unsigned int idle_cycle;
>> + unsigned int state;
>> +};
>> +
>> +/**
>> + * @tsk: an array of pointer to the idle injection tasks
>> + * @waitq: the waiq for the idle injection tasks
>> + */
>> +struct cpuidle_cooling_tsk {
>> + struct task_struct *tsk;
>> + wait_queue_head_t waitq;
>> +};
>> +
>> +DEFINE_PER_CPU(struct cpuidle_cooling_tsk, cpuidle_cooling_tsk);
>
> static ?

Ok.

[ ... ]

>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>> +{
>> + s64 next_wakeup;
>> + int state = idle_cdev->state;
>> +
>> + /*
>> + * The function must never be called when there is no
>> + * mitigation because:
>> + * - that does not make sense
>> + * - we end up with a division by zero
>> + */
>> + BUG_ON(!state);
>
> As there is no locking in place, we can surely hit this case. What if
> the state changed to 0 right before this routine was called ?
>
> I would suggest we should just return 0 in that case and get away with
> the BUG_ON().

Ok.

>> +
>> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>> + idle_cdev->idle_cycle;
>> +
>> + return next_wakeup * NSEC_PER_USEC;
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
>> + * @arg: a void pointer containing the idle cooling device address
>> + *
>> + * This main function does basically two operations:
>> + *
>> + * - Goes idle for a specific amount of time
>> + *
>> + * - Sets a timer to wake up all the idle injection threads after a
>> + * running period
>> + *
>> + * That happens only when the mitigation is enabled, otherwise the
>> + * task is scheduled out.
>> + *
>> + * In order to keep the tasks synchronized together, it is the last
>> + * task exiting the idle period which is in charge of setting the
>> + * timer.
>> + *
>> + * This function never returns.
>> + */
>> +static int cpuidle_cooling_injection_thread(void *arg)
>> +{
>> + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
>> + struct cpuidle_cooling_device *idle_cdev = arg;
>> + struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,
>> + smp_processor_id());
>
> this_cpu_ptr ?

yeah, even better.

>> + DEFINE_WAIT(wait);
>> +
>> + set_freezable();
>> +
>> + sched_setscheduler(current, SCHED_FIFO, &param);
>> +
>> + while (1) {
>> + s64 next_wakeup;
>> +
>> + prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
>> +
>> + schedule();
>> +
>> + atomic_inc(&idle_cdev->count);
>> +
>> + play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
>> +
>> + /*
>> + * The last CPU waking up is in charge of setting the
>> + * timer. If the CPU is hotplugged, the timer will
>> + * move to another CPU (which may not belong to the
>> + * same cluster) but that is not a problem as the
>> + * timer will be set again by another CPU belonging to
>> + * the cluster, so this mechanism is self adaptive and
>> + * does not require any hotplugging dance.
>> + */
>
> Well this depends on how CPU hotplug really happens. What happens to
> the per-cpu-tasks which are in the middle of something when hotplug
> happens? Does hotplug wait for those per-cpu-tasks to finish ?
>
>> + if (!atomic_dec_and_test(&idle_cdev->count))
>> + continue;
>> +
>> + if (!idle_cdev->state)
>> + continue;
>> +
>> + next_wakeup = cpuidle_cooling_runtime(idle_cdev);
>> +
>> + hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
>> + HRTIMER_MODE_REL_PINNED);
>> + }
>> +
>> + finish_wait(&cct->waitq, &wait);
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_release - Kref based release helper
>> + * @kref: a pointer to the kref structure
>> + *
>> + * This function is automatically called by the kref_put function when
>> + * the idle cooling device refcount reaches zero. At this point, we
>> + * have the guarantee the structure is no longer in use and we can
>> + * safely release all the ressources.
>> + */
>> +static void __init cpuidle_cooling_release(struct kref *kref)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev =
>> + container_of(kref, struct cpuidle_cooling_device, kref);
>> +
>> + thermal_cooling_device_unregister(idle_cdev->cdev);
>> + kfree(idle_cdev);
>> +}
>
> Maybe just move it closer to the only function that calls it?

Ok.

[ ... ]

>> +/**
>> + * cpuidle_cooling_get_cur_state - Get the current cooling state
>> + * @cdev: the thermal cooling device
>> + * @state: a pointer to the state
>> + *
>> + * The function just copy the state value from the private thermal
>> + * cooling device structure, the mapping is 1 <-> 1.
>> + *
>> + * The function can not fail, it returns always zero.
>> + */
>> +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
>> + unsigned long *state)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
>
> This line isn't aligned properly. Spaces are present instead of a tab.

Ok.

[ ... ]

> Same at other places as well.

Ok, I will double check it.

[ ... ]

>> +static void cpuidle_cooling_unregister(void)
>> +{
>> + struct cpuidle_cooling_device *tmp, *idle_cdev = NULL;
>> + struct cpuidle_cooling_tsk *cct;
>> + int cpu;
>> +
>> + list_for_each_entry_safe(idle_cdev, tmp, &cpuidle_cdev_list, node) {
>> + for_each_cpu(cpu, idle_cdev->cpumask) {
>> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
>> + if (cct->tsk)
>> + kthread_stop(cct->tsk);
>
> What about hrtimer ? Shouldn't that be stopped as well ?

IMO, practically it is not possible to have a timer running at this
point, but rigorously it must be stopped in the release routine, so I
will add it.

>> + kref_put(&idle_cdev->kref, cpuidle_cooling_release);
>> + }
>> + }
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_register - Idle cooling device initialization function
>> + *
>> + * This function is in charge of creating a cooling device per cluster
>> + * and register it to thermal framework. For this we rely on the
>> + * topology as there is nothing yet describing better the idle state
>> + * power domains.
>> + *
>> + * For each first CPU of the cluster's cpumask, we allocate the idle
>> + * cooling device, initialize the general fields and then we initialze
>> + * the rest in a per cpu basis.
>> + *
>> + * Returns zero on success, < 0 otherwise.
>
> This wouldn't get shown in the doc properly, it should be written as:
>
> * Return: zero on success, < 0 otherwise.

Ok.

>> + */
>> +int cpuidle_cooling_register(void)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev = NULL;
>> + struct thermal_cooling_device *cdev;
>> + struct cpuidle_cooling_tsk *cct;
>> + struct task_struct *tsk;
>> + struct device_node *np;
>> + cpumask_t *cpumask;
>> + char dev_name[THERMAL_NAME_LENGTH];
>> + int ret = -ENOMEM, cpu;
>> + int index = 0;
>> +
>> + for_each_possible_cpu(cpu) {
>> + cpumask = topology_core_cpumask(cpu);
>> +
>> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
>> +
>> + /*
>> + * This condition makes the first cpu belonging to the
>> + * cluster to create a cooling device and allocates
>> + * the structure. Others CPUs belonging to the same
>> + * cluster will just increment the refcount on the
>> + * cooling device structure and initialize it.
>> + */
>> + if (cpu == cpumask_first(cpumask)) {
>
> Your function still have few assumptions of cpu numbering and it will
> break in few cases. What if the CPUs on a big Little system (4x4) are
> present in this order: B L L L L B B B ??
>
> This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,
> 4-7 big and a big CPU is used by the boot loader to bring up Linux.

Ok, how can I sort it out ?

>> + np = of_cpu_device_node_get(cpu);
>> +
>> + idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
>> + if (!idle_cdev)
>> + goto out_fail;
>> +
>> + idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
>> +
>> + atomic_set(&idle_cdev->count, 0);
>
> This should already be 0, isn't it ?

Yes.

>> +
>> + kref_init(&idle_cdev->kref);
>> +
>> + /*
>> + * Initialize the timer to wakeup all the idle
>> + * injection tasks
>> + */
>> + hrtimer_init(&idle_cdev->timer,
>> + CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +
>> + /*
>> + * The wakeup function callback which is in
>> + * charge of waking up all CPUs belonging to
>> + * the same cluster
>> + */
>> + idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
>> +
>> + /*
>> + * The thermal cooling device name
>> + */
>> + snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);
>> + cdev = thermal_of_cooling_device_register(np, dev_name,
>> + idle_cdev,
>> + &cpuidle_cooling_ops);
>
> You registered the cooling device, while the idle_cdev is still
> getting initialized. Ideally, we should register with the thermal core
> (or any other framework) when we are fully ready. For example, any of
> the callbacks present in cpuidle_cooling_ops() can get called by the
> core after this point and you should be ready to handle them. It will
> result in kernel crash in your case as idle_cdev isn't fully
> initialized yet. For example, with the set-state callback you will end
> up calling wake_up_task(NULL).
>
>> + if (IS_ERR(cdev)) {
>> + ret = PTR_ERR(cdev);
>> + goto out_fail;
>> + }
>> +
>> + idle_cdev->cdev = cdev;
>> +
>> + idle_cdev->cpumask = cpumask;
>> +
>> + list_add(&idle_cdev->node, &cpuidle_cdev_list);
>
> I would have removed the above two blank lines as these can all go
> together.

Ok.

>> +
>> + pr_info("Created idle cooling device for cpus '%*pbl'\n",
>> + cpumask_pr_args(cpumask));
>
> I am not sure if it makes sense to print the message right here. We
> can still fail while creating the cooling devices. Maybe a single
> print at the very end of the function is more than enough?

Ok.

>> + }
>> +
>> + kref_get(&idle_cdev->kref);
>
> Did you check if the kref thing is really working here or not? I think
> your structure will never get freed on errors because kref_init()
> initializes the count by 1 and then you do an additional kref_get()
> here for the first CPU of the cluster.

Yes, you are right. I will fix that.

>> +
>> + init_waitqueue_head(&cct->waitq);
>> +
>> + tsk = kthread_create_on_cpu(cpuidle_cooling_injection_thread,
>> + idle_cdev, cpu, "kidle_inject/%u");
>> + if (IS_ERR(tsk)) {
>> + ret = PTR_ERR(tsk);
>> + goto out_fail;
>> + }
>> +
>> + cct->tsk = tsk;
>> +
>> + wake_up_process(tsk);
>> + }
>> +
>> + return 0;
>> +
>> +out_fail:
>> + cpuidle_cooling_unregister();
>> + pr_err("Failed to create idle cooling device (%d)\n", ret);
>
> So you are already printing the error message here, just make the
> return type void as the caller of this can't do anything anyway.

Ok.

>> +
>> + return ret;
>> +}
>> +#endif /* CONFIG_CPU_IDLE_THERMAL */


Thanks for the review.

-- Daniel


--
<http://www.linaro.org/> Linaro.org â Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog