Re: [PATCH] Documentation: admin-guide: PM: Add cpuidle document

From: Viresh Kumar
Date: Wed Nov 28 2018 - 00:48:43 EST


On 26-11-18, 14:11, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>
> Important information is missing from user/admin cpuidle documentation
> available today, so add a new user/admin document for cpuidle containing
> current and comprehensive information to admin-guide and drop the old
> .txt documents it is replacing.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> ---
> Documentation/admin-guide/pm/cpuidle.rst | 603 +++++++++++++++++++++++++
> Documentation/admin-guide/pm/working-state.rst | 1
> Documentation/cpuidle/core.txt | 23
> Documentation/cpuidle/sysfs.txt | 98 ----
> 4 files changed, 604 insertions(+), 121 deletions(-)

Nice work Rafael. Minor nits below..

> Index: linux-pm/Documentation/admin-guide/pm/cpuidle.rst

> +The ``menu`` Governor
> +=====================
> +
> +The ``menu`` governor is the default ``CPUIdle`` governor for tickless systems.
> +It is quite complex, but the basic principle of its design is straightforward.
> +Namely, when invoked to select an idle state for a CPU (i.e. an idle state that
> +the CPU will ask the processor hardware to enter), it attempts to predict the
> +idle duration and uses the predicted value for idle state selection.
> +
> +It first obtains the time until the closest timer event with the assumption
> +that the scheduler tick will be stopped. That time, referred to as the *sleep
> +length* in what follows, is the upper bound on the time before the next CPU
> +wakeup. It is used to determine the sleep length range, which in turn is needed
> +to get the sleep length correction factor.
> +
> +The ``menu`` governor maintains two arrays of sleep length correction factors.
> +One of them is used when tasks previously running on the given CPU are waiting
> +for some I/O operations to complete and the other one is used when that is not
> +the case. Each array contains several correction factor values that correspond
> +to different sleep length ranges organized so that each range represented in the
> +array is approximately 10 times wider than the previous one.
> +
> +The correction factor for the given sleep length range (determined before
> +selecting the idle state for the CPU) is updated after the CPU has been woken
> +up and the closer the sleep length is to the observed idle duration, the closer
> +to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
> +The sleep length is multiplied by the correction factor for the range that it
> +falls into to obtain the first approximation of the predicted idle duration.
> +
> +Next, the governor uses a simple pattern recognition algorithm to refine its
> +idle duration prediction. Namely, it saves the last 8 observed idle duration
> +values and, when predicting the idle duration next time, it computes the average
> +and variance of them. If the variance is small (smaller than 400 square
> +milliseconds) or it is small relative to the average (the average is greater
> +that 6 times the standard deviation), the average is regarded as the "typical
> +interval" value. Otherwise, the longest of the saved observed idle duration
> +values is discarded and the computation is repeated for the remaining ones.
> +Again, if the variance of them is small (in the above sense), the average is
> +taken as the "typical interval" value and so on, until either the "typical
> +interval" is determined or too many data points are disregarded, in which case
> +the "typical interval" is assumed to equal "infinity" (the maximum unsigned
> +integer value). The "typical interval" computed this way is compared with the
> +sleep length multiplied by the correction factor and the minumum of the two is

minimum

> +taken as the predicted idle duration.
> +
> +Then, the governor computes an extra latency limit to help "interactive"
> +workloads. It uses the obsevation that if the exit latency of the selected idle

observation

> +state is comparable with the predicted idle duration, the total time spent in
> +that state probably will be very short and the amount of energy to save by
> +entering it will be relatively small, so likely it is better to avoid the
> +overhead related to entering that state and exiting it. Thus selecting a
> +shallower state is likely to be a better option then. The first approximation
> +of the extra latency limit is the predicted idle duration itself which
> +additionally is divided by a value depending on the number of tasks that
> +previously ran on the given CPU and now they are waiting for I/O operations to
> +complete. The result of that division is compared with the latency limit coming
> +from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
> +framework and the minimum of the two is taken as the limit for the idle states'
> +exit latency.
> +
> +Now, the governor is ready to walk the list of idle states and choose one of
> +them. For this purpose, it compares the target residency of each state with
> +the predicted idle duration and the exit latecy of it with the computed latency

latency

> +limit. It selects the state with the target residency closest to the predicted
> +idle duration, but still below it, and exit latency that does not exceed the
> +limit.
> +
> +In the final step the governor may still need to refine the idle state selection
> +if it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
> +happens if the idle duration predicted by it is less than the tick period and
> +the tick has not been stopped already (in a previous iteration of the idle
> +loop). Then, the sleep length used in the previous computations may not reflect
> +the real time until the closest timer event and if it really is geater than that

greater

> +time, the governor may need to select a shallower state with a suitable target
> +residency.
> +
> +

What about a short section for the ladder governor as well ?

> +.. _idle-states-representation:
> +
> +Representation of Idle States
> +=============================
> +
> +For the CPU idle time management purposes all of the physical idle states
> +supported by the processor have to be represented as a one-dimensional array of
> +|struct cpuidle_state| objects each allowing an individual (logical) CPU to ask
> +the processor hardware to enter an idle state of certain properties. If there
> +is a hierarchy of units in the processor, one |struct cpuidle_state| object can
> +cover a combination of idle states supported by the units at different levels of
> +the hierarchy. In that case, the `target residency and exit latency parameters
> +of it <idle-loop_>`_, must reflect the properties of the idle state at the
> +deepest level (i.e. the idle state of the unit containing all of the other
> +units).
> +
> +For example, take a processor with two cores in a larger unit referred to as
> +a "module" and suppose that asking the hardware to enter a specific idle state
> +(say "X") at the "core" level by one core will trigger the module to try to
> +enter a specific idle state of its own (say "MX") if the other core is in idle
> +state "X" already. In other words, asking for idle state "X" at the "core"
> +level gives the hardware a license to go as deep as to idle state "MX" at the
> +"module" level, but there is no guarantee that this is going to happen (the core
> +asking for idle state "X" may just end up in that state by itself instead).
> +Then, the target residency of the |struct cpuidle_state| object representing
> +idle state "X" must reflect the minimum time to spend in idle state "MX" of
> +the module (including the time needed to enter it), because that is the minimum
> +time the CPU needs to be idle to save any energy in case the hardware enters
> +that state. Analogously, the exit latency parameter of that object must cover
> +the exit time of idle state "MX" of the module (and usually its entry time too),
> +because that is the maximum delay between a wakeup signal and the time the CPU
> +will start to execute the first new instruction (assuming that both cores in the
> +module will always be ready to execute instructions as soon as the module
> +becomes operational as a whole).
> +
> +In addition to the target residency and exit latency idle state parameters
> +discussed above, the objects representing idle states each contain a few other
> +parameters describing the idle state and a pointer to the function to run in
> +order to ask the hardware to enter that state. Also, for each
> +|struct cpuidle_state| object, there is a corresponding
> +:c:type:`struct cpuidle_state_usage <cpuidle_state_usage>` one containig usage

containing

> +statistics of the given idle state. That information is exposed by the kernel
> +via ``sysfs``.
> +
> +For each CPU in the system, there is a :file:`/sys/devices/system/cpu<N>/cpuidle/`
> +directory in ``sysfs``, where the number ``<N>`` is assigned to the given
> +CPU at the initialization time. That directory contains a set of subdirectories
> +called :file:`state0`, :file:`state1` and so on, up to the number of idle state
> +objects defined for the given CPU minus one. Each of these directories contains
> +a number of files (attributes) representing the properties of the idle state
> +object corresponding to it, as follows:
> +
> +
> +``desc``
> + Description of the idle state.
> +
> +``disable``
> + Whether or not this idle state is disabled.
> +
> +``latency``
> + Exit latency of the idle state in microseconds.
> +
> +``name``
> + Name of the idle state.
> +
> +``power``
> + Power drawn by hardware in this idle state in milliwatts (if specified,
> + 0 otherwise).
> +
> +``residency``
> + Target residency of the idle state in microseconds.
> +
> +``time``
> + Total time spent in this idle state by the given CPU (as measured by the
> + kernel) in microseconds.
> +
> +``usage``
> + Total number of times the hardware has been asked by the given CPU to
> + enter this idle state.
> +
> +The :file:`desc` and :file:`name` files both contain strings. The difference
> +between them is that the name is expected to be more concise, while the
> +description may be longer and it may contain white space or special characters.
> +The other files listed above contain integer numbers.
> +
> +The :file:`disable` attribute is the only writeable one. If it contains 1, the
> +given idle state is disabled for this particular CPU, which means that the
> +governor will never select it for this particular CPU and the ``CPUIdle``
> +driver will never ask the hardware to enter it for that CPU as a result.
> +However, disabling an idle state for one CPU does not prevent it from being
> +asked for by the other CPUs, so it must be disabled for all of them in order to
> +never be asked for by any of them. [Note that, due to the way the ``ladder``
> +governor is implemented, disabling an idle state prevents that governor from
> +selecting any idle states deeper than the disabled one too.]
> +
> +If the :file:`disable` attribute contains 0, the given idle state is enabled for
> +this particular CPU, but it still may be disabled for some or all of the other
> +CPUs in the system at the same time. Writing 1 to it causes the idle state to
> +be disabled for this particular CPU and writing 0 to it allows the governor to
> +take it into consideration for the given CPU and the driver to ask for it,
> +unless that state was disabled globally in the driver (in which case it cannot
> +be used at all).
> +
> +The :file:`power` attribute is not defined very well, especially for idle state
> +objects representing combinations of idle states at different levels of the
> +hierarchy of units in the processor, and it generally is hard to obtain idle
> +state power numbers for complex hardware, so :file:`power` often contains 0 (not
> +available) and if it contains a nonzero number, that number may not be very
> +accurate and it should not be relied on for anything meaningful.
> +
> +The number in the :file:`time` file generally may be greater than the total time
> +really spent by the given CPU in the given idle state, because it is measured by
> +the kernel and it may not cover the cases in which the hardware refused to enter
> +this idle state and entered a shallower one instead of it (or even it did not
> +enter any idle state at all). The kernel can only measure the time span between
> +asking the hardware to enter an idle state and the subsequent wakeup of the CPU
> +and it cannot say what really happened in the meantime at the hardware level.
> +Moreover, if the idle state object in question represents a combination of idle
> +states at different levels of the hierarchy of units in the processor,
> +the kernel can never say how deep the hardware went down the hierarchy in any
> +particular case. For these reasons, the only reliable way to find out how
> +much time has been spent by the hardware in different idle states supported by
> +it is to use idle state residency counters in the hardware, if available.
> +
> +

Maybe I missed, but I couldn't find any text that says what state 0, 1, ... N
mean. Like which is the deepest idle state and which one is the shallowest.

> +.. _cpu-pm-qos:
> +
> +Power Management Quality of Service for CPUs
> +============================================
> +
> +The power management quality of service (PM QoS) framework in the Linux kernel
> +allows kernel code and user space processes to set constraints on various
> +energy-efficiency features of the kernel to prevent performance from dropping
> +below a required level. The PM QoS constraints can be set globally, in
> +predefined categories referred to as PM QoS classes, or against individual
> +devices.
> +
> +CPU idle time management can be affected by PM QoS in two ways, through the
> +global constraint in the ``PM_QOS_CPU_DMA_LATENCY`` class and through the
> +resume latency constraints for individual CPUs. Kernel code (e.g. device
> +drivers) can set both of them with the help of special internal interfaces
> +provided by the PM QoS framework. User space can modify the former by opeining

opening

> +the :file:`cpu_dma_latency` special device file under :file:`/dev/` and writing
> +a binary value (interpreted as a signed 32-bit integer) to it. In turn, the
> +resume latency constraint for a CPU can be modified by user space by writing a
> +string (representing a signed 32-bit integer) to the
> +:file:`power/pm_qos_resume_latency_us` file under
> +:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs``, where the CPU number
> +``<N>`` is allocated at the system initialization time. Negative values
> +will be rejected in both cases and, also in both cases, the written integer
> +number will be interpreted as a requested PM QoS constraint in microseconds.
> +
> +The requested value is not automatically applied as a new constraint, however,
> +as it may be less restrictive (greater in this particular case) than another
> +constraint previously requested by someone else. For this reason, the PM QoS
> +framework maintains a list of requests that have been made so far in each
> +global class and for each device, aggregates them and applies the effective
> +(minimum in this particular case) value as the new constraint.
> +
> +In fact, opening the :file:`cpu_dma_latency` special device file causes a new
> +PM QoS request to be created and added to the priority list of requests in the
> +``PM_QOS_CPU_DMA_LATENCY`` class and the file descriptor coming from the
> +"open" operation represents that request. If that file descriptor is then
> +used for writing, the number written to it will be associated with the PM QoS
> +request represented by it as a new requested constraint value. Next, the
> +priority list mechanism will be used to determine the new effective value of
> +the entire list of requests and that effective value will be set as a new
> +constraint. Thus setting a new requested constraint value will only change the
> +real constraint if the effective "list" value is affected by it. In particular,
> +for the ``PM_QOS_CPU_DMA_LATENCY`` class it only affects the real constraint if
> +it is the minimum of the requested contraints in the list. The process holding

constraints

> +a file descriptor obtained by opening the :file:`cpu_dma_latency` special device
> +file controls the PM QoS request associated with that file descriptor, but it
> +controls this particular PM QoS request only.
> +
> +Closing the :file:`cpu_dma_latency` special device file or, more precisely, the
> +file descriptor obtained while opening it, causes the PM QoS request associated
> +with that file descriptor to be removed from the ``PM_QOS_CPU_DMA_LATENCY``
> +class priority list and destroyed. If that happens, the priority list mechanism
> +will be used, again, to determine the new effective value for the whole list
> +and that value will become the new real constraint.
> +
> +In turn, for each CPU there is only one resume latency PM QoS request
> +associated with the :file:`power/pm_qos_resume_latency_us` file under
> +:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs`` and writing to it causes
> +this single PM QoS request to be updated regardless of which user space
> +process does that. In other words, this PM QoS request is shared by the entire
> +user space, so access to the file associated with it needs to be arbitrated
> +to avoid confusion. [Arguably, the only legitimate use of this mechanism in
> +practice is to pin a process to the CPU in question and let it use the
> +``sysfs`` interface to control the resume latency constraint for it.] It
> +still only is a request, however. It is a member of a priority list used to
> +determine the effective value to be set as the resume latency constraint for the
> +CPU in question every time the list of requests is updated this way or another
> +(there may be other requests coming from kernel code in that list).
> +
> +CPU idle time governors are expected to regard the minimum of the global
> +effective ``PM_QOS_CPU_DMA_LATENCY`` class constraint and the effective
> +resume latency constraint for the given CPU as the upper limit for the exit
> +latency of the idle states they can select for that CPU. They should never
> +select any idle states with exit latency beyond that limit.
> +

--
viresh