Re: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct
From: Chanwoo Choi
Date: Wed Feb 24 2021 - 02:49:42 EST
On 2/16/21 7:41 PM, Lukasz Luba wrote:
> Hi Chanwoo,
>
> On 2/15/21 3:00 PM, Chanwoo Choi wrote:
>> Hi Lukasz,
>>
>> On Fri, Feb 12, 2021 at 7:28 AM Lukasz Luba <lukasz.luba@xxxxxxx> wrote:
>>>
>>>
>>>
>>> On 2/11/21 11:07 AM, Lukasz Luba wrote:
>>>> Hi Chanwoo,
>>>>
>>>> On 2/3/21 10:21 AM, Lukasz Luba wrote:
>>>>> Hi Chanwoo,
>>>>>
>>>>> Thank you for looking at this.
>>>>>
>>>>> On 2/3/21 10:11 AM, Chanwoo Choi wrote:
>>>>>> Hi Lukasz,
>>>>>>
>>>>>> When accessing the max_freq and min_freq at devfreq-cooling.c,
>>>>>> even if can access 'user_max_freq' and 'lock' by using the 'devfreq'
>>>>>> instance,
>>>>>> I think that the direct access of variables
>>>>>> (lock/user_max_freq/user_min_freq)
>>>>>> of struct devfreq are not good.
>>>>>>
>>>>>> Instead, how about using the 'DEVFREQ_TRANSITION_NOTIFIER'
>>>>>> notification with following changes of 'struct devfreq_freq'?
>>>>>
>>>>> I like the idea with devfreq notification. I will have to go through the
>>>>> code to check that possibility.
>>>>>
>>>>>> Also, need to add codes into devfreq_set_target() for initializing
>>>>>> 'new_max_freq' and 'new_min_freq' before sending the DEVFREQ_POSTCHANGE
>>>>>> notification.
>>>>>>
>>>>>> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
>>>>>> index 147a229056d2..d5726592d362 100644
>>>>>> --- a/include/linux/devfreq.h
>>>>>> +++ b/include/linux/devfreq.h
>>>>>> @@ -207,6 +207,8 @@ struct devfreq {
>>>>>> struct devfreq_freqs {
>>>>>> unsigned long old;
>>>>>> unsigned long new;
>>>>>> + unsigned long new_max_freq;
>>>>>> + unsigned long new_min_freq;
>>>>>> };
>>>>>>
>>>>>>
>>>>>> And I think that new 'user_min_freq'/'user_max_freq' are not necessary.
>>>>>> You can get the current max_freq/min_freq by using the following steps:
>>>>>>
>>>>>> get_freq_range(devfreq, &min_freq, &max_freq);
>>>>>> dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>>>>> dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>>>>>
>>>>>> So that you can get the 'max_freq/min_freq' and then
>>>>>> initialize the 'freqs.new_max_freq and freqs.new_min_freq'
>>>>>> with them as following:
>>>>>>
>>>>>> in devfreq_set_target()
>>>>>> get_freq_range(devfreq, &min_freq, &max_freq);
>>>>>> dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>>>>> dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>>>>> freqs.new_max_freq = min_freq;
>>>>>> freqs.new_max_freq = max_freq;
>>>>>> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);
>>>>>
>>>>> I will plumb it in and check that option. My concern is that function
>>>>> get_freq_range() would give me the max_freq value from PM QoS, which
>>>>> might be my thermal limit - lower that user_max_freq. Then I still
>>>>> need
>>>>>
>>>>> I've been playing with PM QoS notifications because I thought it would
>>>>> be possible to be notified in thermal for all new set values - even from
>>>>> devfreq sysfs user max_freq write, which has value higher that the
>>>>> current limit set by thermal governor. Unfortunately PM QoS doesn't
>>>>> send that information by design. PM QoS also by desing won't allow
>>>>> me to check first two limits in the plist - which would be thermal
>>>>> and user sysfs max_freq.
>>>>>
>>>>> I will experiment with this notifications and share the results.
>>>>> That you for your comments.
>>>>
>>>> I have experimented with your proposal. Unfortunately, the value stored
>>>> in the pm_qos which is read by get_freq_range() is not the user max
>>>> freq. It's the value from thermal devfreq cooling when that one is
>>>> lower. Which is OK in the overall design, but not for my IPA use case.
>>>>
>>>> What comes to my mind is two options:
>>>> 1) this patch proposal, with simple solution of two new variables
>>>> protected by mutex, which would maintain user stored values
>>>> 2) add a new notification chain in devfreq to notify about new
>>>> user written value, to which devfreq cooling would register; that
>>>> would allow devfreq cooling to get that value instantly and store
>>>> locally
>>>
>>> 3) How about new define for existing notification chain:
>>> #define DEVFREQ_USER_CHANGE (2)
>>
>> I think that if we add the notification with specific actor like user change
>> or OPP change or others, it is not proper. But, we can add the notification
>> for min or max frequency change timing. Because the devfreq already has
>> the notification for current frequency like DEVFREQ_PRECHANGE,
>> DEVFREQ_POSTCHANGE.
>>
>> Maybe, we can add the following notification for min/max_freq.
>> The following min_freq and max_freq values will be calculated by
>> get_freq_range().
>> DEVFREQ_MIN_FREQ_PRECHANGE
>> DEVFREQ_MIN_FREQ_POSTCHANGE
>> DEVFREQ_MAX_FREQ_PRECHANGE
>> DEVFREQ_MAX_FREQ_POSTCHANGE
>
> Would it be then possible to pass the user max freq value written via
> sysfs? Something like in the example below, when writing into max sysfs:
>
> 1) starting in max_freq_store()
> freqs.new_max_freq = max_freq;
> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_MAX_FREQ_PRECHANGE);
> dev_pm_qos_update_request()
When we use the PRECHANGE and POSTCHANGE notification,
we should keep the consistent value.
When PRECHANGE, notify the previous min/max frequency
containing the user input/cooling policy/OPP.
When POSTCHANGE, notify the new min/max frequency
containing the user input/cooling policy/OPP.
But, in case of your suggestion, DEVFREQ_MAX_FREQ_PRECHANGE considers
only user input without cooling policy/opp.
>
> 2)then after a while in devfreq_set_target()
> get_freq_range(devfreq, &min_freq, &max_freq);
> dev_pm_opp_find_freq_floor(pdev, &min_freq);
> dev_pm_opp_find_freq_floor(pdev, &max_freq);
> freqs.new_min_freq = min_freq;
> freqs.new_max_freq = max_freq;
> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_MAX_FREQ_POSTCHANGE);
>
> This 2nd part is called after the PM QoS has changed that limit,
> so might be missing (in case value was higher that current),
> but thermal would know about that, so no worries.
It doesn't focus on only thermal. We need to consider
all potential user of max_freq notification.
In the devfreq subsystem like devfreq governor,
we might use the user min/max_freq without any restrictions.
But, in this case, devfreq provides the min/max_freq
to outside subsystem/drivers like devfreq-cooling.c of thermal.
IMHO, it is difficult to agree this approach.
If devfreq provides the various min/max_freq value to outside
of devfreq, it makes the confusion to understand the meaning
of min/max_freq. Actually, the other user doesn't need to
know the user input for min/max_freq.
>
>>
>>
>>>
>>> Then a modified devfreq_notify_transition() would get:
>>> @@ -339,6 +339,10 @@ static int devfreq_notify_transition(struct devfreq
>>> *devfreq,
>>>
>>> srcu_notifier_call_chain(&devfreq->transition_notifier_list,
>>> DEVFREQ_POSTCHANGE, freqs);
>>> break;
>>> + case DEVFREQ_USER_CHANGE:
>>> + srcu_notifier_call_chain(&devfreq->transition_notifier_list,
>>> + DEVFREQ_USER_CHANGE, freqs);
>>> + break;
>>> default:
>>> return -EINVAL;
>>> }
>>>
>>> If that is present, I can plumb your suggestion with:
>>> struct devfreq_freq {
>>> + unsigned long new_max_freq;
>>> + unsigned long new_min_freq;
>>>
>>> and populate them with values in the max_freq_store() by adding at the
>>> end:
>>>
>>> freqs.new_max_freq = max_freq;
>>> mutex_lock();
>>> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_USER_CHANGE);
>>> mutex_unlock();
>>>
>>> I would handle this notification in devfreq cooling and keep the
>>> value there, for future IPA checks.
>>>
>>> If you agree, I can send next version of the patch set.
>>>
>>>>
>>>> What do you think Chanwoo?
>>
>> I thought that your suggestion to expose the user input for min/max_freq.
>> But, these values are not valid for the public user. Actually, the devfreq core
>> handles these values only internally without any explicit access from outside.
>>
>> I'm not sure that it is right or not to expose the internal value of
>> devfreq struct.
>> Until now, I think that it is not proper to show the interval value outside.
>>
>> Because the devfreq subsystem only provides the min_freq and max_freq
>> which reflect the all requirement of user input/cooling policy/OPP
>> instead of user_min_freq, user_max_freq.
>>
>> If we provide the user_min_freq, user_max_freq via DEVFREQ notification,
>> we have to make the new sysfs attributes for user_min_freq and user_max_freq
>> to show the value to the user. But, it seems that it is not nice.
>
> I would say we don't have to expose it. Let's take a closer look into
> an example. The main problem is with GPUs. The middleware is aware of
> the OPPs in the GPU. If the middleware wants to switch into different
> power-performance mode e.g. power-saving, it writes into this sysfs
> the max allowed freq. IPA does not know about it and makes wrong
> decisions. As you said, the sysfs read operation combines all:
> user input/cooling policy/OPP, but that's not a problem for this aware
> middleware. So it can stay as is.
> The only addition would be this 'notification about user attempt of
> reducing max device speed' internally inside the kernel, for those
> subsystems which are interested in it.
As I commented on above, I'm not sure to provide the multiple
min/max_freq to outside of devfreq subsytem. Instead, it is ok
to use user min/max_freq inside of devfreq subsystem.
Unfortunately, I didn't suggests the good solution.
It is very important changes. So that I want to consider
the all users of devfreq.
>
>>
>> Actually, I have no other idea how to support your feature.
>> We try to find the more proper method.
>>
>
> Thank you for coming back with your comments. I know it's not
> an easy feature but I hope we can find a solution.
>
> Regards,
> Lukasz
>
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics