Re: [PATCH v4] power: introduce library for device-specific OPPs

From: Paul E. McKenney
Date: Fri Sep 24 2010 - 17:40:14 EST


On Fri, Sep 24, 2010 at 04:26:21PM -0500, Nishanth Menon wrote:
> Paul E. McKenney had written, on 09/24/2010 02:37 PM, the following:
> [...]
> >
> >Looks like a good start!!! Some questions and suggestions about RCU
> Thanks for the review.. few comments below..

Back at you! ;-)

> >usage interspersed below.
> >
> > Thanx, Paul
> [...]
> >>+
> >>+/**
> >>+ * find_device_opp() - find device_opp struct using device pointer
> >>+ * @dev: device pointer used to lookup device OPPs
> >>+ *
> >>+ * Search list of device OPPs for one containing matching device. Does a RCU
> >>+ * reader operation to grab the pointer needed.
> >>+ *
> >>+ * Returns pointer to 'struct device_opp' if found, otherwise -ENODEV or
> >>+ * -EINVAL based on type of error.
> >>+ */
> >>+static struct device_opp *find_device_opp(struct device *dev)
> >>+{
> >>+ struct device_opp *tmp_dev_opp, *dev_opp = ERR_PTR(-ENODEV);
> >>+
> >>+ if (unlikely(!dev || IS_ERR(dev))) {
> >>+ pr_err("%s: Invalid parameters being passed\n", __func__);
> >>+ return ERR_PTR(-EINVAL);
> >>+ }
> >>+
> >>+ rcu_read_lock();
> >>+ list_for_each_entry_rcu(tmp_dev_opp, &dev_opp_list, node) {
> >>+ if (tmp_dev_opp->dev == dev) {
> >>+ dev_opp = tmp_dev_opp;
> >>+ break;
> >>+ }
> >>+ }
> >>+ rcu_read_unlock();
> >
> >What prevents the structure pointed to by dev_opp from being freed
> >at this point? We are no longer in an RCU read-side critical section,
> >so RCU grace periods starting during the above RCU read-side critical
> >section can now end.
>
> dev_opp is never freed in the implementation -> it represents
> domains, only adds with list_add_rcu() is done -> wont the usage be
> safe then? or I being blind?
>
> the reason why we dont free is coz of the following: dev_opp
> represents voltage domains in opp library. SoC frameworks are
> required to register only those voltage domain opp that are
> required. by allowing a free logic, I knew it'd have complicated the
> implementation way beyond what we needed it to be.

Perhaps I was confusing two different data structures, if so, apologies.

So you are freeing the opp level, but never the dev_opp level, then?

But yes, if you are only adding and never deleting, then it is safe to
pass the pointers out of an RCU read-side critical section. But please
add a comment saying why you are doing this. Otherwise, Coccinelle will
cause me to continue complaining about this to you. ;-)

And the later uses still look buggy to me, please see below.

> >Here is an example sequence of events that I am worried about:
> >
> >o CPU 1 enters find_device_opp(), and pick up a pointer to
> > a given device opp.
> >
> >o CPU 2 executes opp_set_availability(), replacing that same
> > device opp with a new one. It then calls synchronize_rcu()
> > which blocks waiting for CPU 1 to exit its RCU read-side
> > critical section.
> >
> >o CPU 1 exits its RCU read-side critical section, arriving at
> > this point in the code.
> >
> >o CPU 2's synchronize_rcu() is now permitted to return, executing
> > the kfree(), which frees up the memory that CPU 1's dev_opp
> > pointer references.
> >
> >o This newly freed memory is allocated for some other structure
> > by CPU 3. CPU 1 and CPU 3 are now trying to use the same
> > memory for two different structures, and nothing good can
> > possibly come of this. The kernel dies a brutal and nasty
> > death.
> >
> >One way to fix this is to have the caller do rcu_read_lock() before
> >calling find_device_opp(), and to do rcu_read_unlock() only after the
> >caller has finished using the pointer that find_device_opp() returns.
> >This works well unless the caller needs to do some blocking operation
> >before it gets done using the pointer.
> >
> >Another approach is for find_device_opp() to use a reference count on
> >the structure, and for opp_set_availability() to avoid freeing the
> >structure unless/until the reference counter drops to zero.
> >
> >There are other approaches as well, please feel free to take a look
> >at Documentation/RCU/rcuref.txt for more info on using reference
> >counting and RCU.
> thx. I probably should read yet again if I got my understanding of
> usage right..
>
> >
> [...]
> >>+
> >>+/**
> >>+ * opp_find_freq_exact() - search for an exact frequency
> >>+ * @dev: device for which we do this operation
> >>+ * @freq: frequency to search for
> >>+ * @is_available: true/false - match for available opp
> >>+ *
> >>+ * Searches for exact match in the opp list and returns pointer to the matching
> >>+ * opp if found, else returns ERR_PTR in case of error and should be handled
> >>+ * using IS_ERR.
> >>+ *
> >>+ * Note: available is a modifier for the search. if available=true, then the
> >>+ * match is for exact matching frequency and is available in the stored OPP
> >>+ * table. if false, the match is for exact frequency which is not available.
> >>+ *
> >>+ * This provides a mechanism to enable an opp which is not available currently
> >>+ * or the opposite as well.
> >>+ *
> >>+ * Locking: RCU reader.
> >>+ */
> >>+struct opp *opp_find_freq_exact(struct device *dev,
> >>+ unsigned long freq, bool available)
> >>+{
> >>+ struct device_opp *dev_opp;
> >>+ struct opp *temp_opp, *opp = ERR_PTR(-ENODEV);
> >>+
> >>+ dev_opp = find_device_opp(dev);
> >>+ if (IS_ERR(dev_opp))
> >>+ return opp;
> >>+
> >>+ rcu_read_lock();
> >>+ list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> >>+ if (temp_opp->available == available &&
> >>+ temp_opp->rate == freq) {
> >>+ opp = temp_opp;
> >>+ break;
> >>+ }
> >>+ }
> >>+ rcu_read_unlock();
> >
> >But this one sadly has the same problem that find_device_opp() does.
> is the concern about opp OR about dev_opp here? I am guessing opp..
> >
> >>+ return opp;
> >>+}
> >>+
> >>+/**
> >>+ * opp_find_freq_ceil() - Search for an rounded ceil freq
> >>+ * @dev: device for which we do this operation
> >>+ * @freq: Start frequency
> >>+ *
> >>+ * Search for the matching ceil *available* OPP from a starting freq
> >>+ * for a device.
> >>+ *
> >>+ * Returns matching *opp and refreshes *freq accordingly, else returns
> >>+ * ERR_PTR in case of error and should be handled using IS_ERR.
> >>+ *
> >>+ * Locking: RCU reader.
> >>+ */
> >>+struct opp *opp_find_freq_ceil(struct device *dev, unsigned long *freq)
> >>+{
> >>+ struct device_opp *dev_opp;
> >>+ struct opp *temp_opp, *opp = ERR_PTR(-ENODEV);
> >>+
> >>+ if (!dev || !freq) {
> >>+ pr_err("%s: invalid param dev=%p freq=%p\n", __func__,
> >>+ dev, freq);
> >>+ return ERR_PTR(-EINVAL);
> >>+ }
> >>+
> >>+ dev_opp = find_device_opp(dev);
> >>+ if (IS_ERR(dev_opp))
> >>+ return opp;
> >>+
> >>+ rcu_read_lock();
> >>+ list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> >>+ if (temp_opp->available && temp_opp->rate >= *freq) {
> >>+ opp = temp_opp;
> >>+ *freq = opp->rate;
> >>+ break;
> >>+ }
> >>+ }
> >>+ rcu_read_unlock();
> >
> >And this one also has the same problem that find_device_opp() does.
>
> guessing opp ptr here.. am I right? if it is about device_opp, it is
> not going to be freed as I mentioned above - at least we dont give
> an function to update(hence free) it.

It really does look to me that you are passing a pointer to the thing
being freed out of an RCU read-side critical section. If you are really
doing this, you do need to do something to fix it. I outlined some of
the options earlier.

> >>+ return opp;
> >>+}
> >>+
> >>+/**
> >>+ * opp_find_freq_floor() - Search for a rounded floor freq
> >>+ * @dev: device for which we do this operation
> >>+ * @freq: Start frequency
> >>+ *
> >>+ * Search for the matching floor *available* OPP from a starting freq
> >>+ * for a device.
> >>+ *
> >>+ * Returns matching *opp and refreshes *freq accordingly, else returns
> >>+ * ERR_PTR in case of error and should be handled using IS_ERR.
> >>+ *
> >>+ * Locking: RCU reader.
> >>+ */
> >>+struct opp *opp_find_freq_floor(struct device *dev, unsigned long *freq)
> >>+{
> >>+ struct device_opp *dev_opp;
> >>+ struct opp *temp_opp, *opp = ERR_PTR(-ENODEV);
> >>+
> >>+ if (!dev || !freq) {
> >>+ pr_err("%s: invalid param dev=%p freq=%p\n", __func__,
> >>+ dev, freq);
> >>+ return ERR_PTR(-EINVAL);
> >>+ }
> >>+
> >>+ dev_opp = find_device_opp(dev);
> >>+ if (IS_ERR(dev_opp))
> >>+ return opp;
> >>+
> >>+ rcu_read_lock();
> >>+ list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> >>+ if (temp_opp->available) {
> >>+ /* go to the next node, before choosing prev */
> >>+ if (temp_opp->rate > *freq)
> >>+ break;
> >>+ else
> >>+ opp = temp_opp;
> >>+ }
> >>+ }
> >>+ if (!IS_ERR(opp))
> >>+ *freq = opp->rate;
> >>+ rcu_read_unlock();
> >
> >As does this one.
>
> guessing opp ptr here.. am I right?

Again, here it looks to me like you are passing a pointer out of an RCU
read-side critical section that could be freed out from under you. If
so, again, this must be fixed.

> >>+ return opp;
> >>+}
> >>+
> >>+/**
> >>+ * opp_add() - Add an OPP table from a table definitions
> >>+ * @dev: device for which we do this operation
> >>+ * @freq: Frequency in Hz for this OPP
> >>+ * @u_volt: Voltage in uVolts for this OPP
> >>+ *
> >>+ * This function adds an opp definition to the opp list and returns status.
> >>+ * The opp is made available by default and it can be controlled using
> >>+ * opp_enable/disable functions.
> >>+ *
> >>+ * Locking: RCU, mutex
> >>+ */
> >>+int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt)
> >>+{
> >>+ struct device_opp *tmp_dev_opp, *dev_opp = NULL;
> >>+ struct opp *opp, *new_opp;
> >>+ struct list_head *head;
> >>+
> >>+ /* Check for existing list for 'dev' */
> >>+ rcu_read_lock();
> >>+ list_for_each_entry_rcu(tmp_dev_opp, &dev_opp_list, node) {
> >>+ if (dev == tmp_dev_opp->dev) {
> >>+ dev_opp = tmp_dev_opp;
> >>+ break;
> >>+ }
> >>+ }
> >>+ rcu_read_unlock();
> >>+
> >>+ /* allocate new OPP node */
> >>+ new_opp = kzalloc(sizeof(struct opp), GFP_KERNEL);
> >>+ if (!new_opp) {
> >>+ pr_warning("%s: unable to allocate new opp node\n",
> >>+ __func__);
> >>+ return -ENOMEM;
> >>+ }
> >>+
> >>+ if (!dev_opp) {
> >>+ /* Allocate a new device OPP table */
> >>+ dev_opp = kzalloc(sizeof(struct device_opp), GFP_KERNEL);
> >>+ if (!dev_opp) {
> >>+ kfree(new_opp);
> >>+ pr_warning("%s: unable to allocate device structure\n",
> >>+ __func__);
> >>+ return -ENOMEM;
> >>+ }
> >>+
> >>+ dev_opp->dev = dev;
> >>+ INIT_LIST_HEAD(&dev_opp->opp_list);
> >>+ mutex_init(&dev_opp->lock);
> >>+
> >>+ /* Secure the device list modification */
> >>+ mutex_lock(&dev_opp_list_lock);
> >>+ list_add_rcu(&dev_opp->node, &dev_opp_list);
> >>+ mutex_unlock(&dev_opp_list_lock);
> >>+ synchronize_rcu();
> >
> >You do not need to wait for an RCU grace period when adding objects, only
> >between removing them and freeing them.
>
> ouch.. my bad.. thx.. will fix
>
> >
> >>+ }
> >>+
> >>+ /* populate the opp table */
> >>+ new_opp->dev_opp = dev_opp;
> >>+ new_opp->rate = freq;
> >>+ new_opp->u_volt = u_volt;
> >>+ new_opp->available = true;
> >>+
> >>+ /* make the dev_opp modification safe */
> >>+ mutex_lock(&dev_opp->lock);
> >>+
> >>+ rcu_read_lock();
> >>+ /* Insert new OPP in order of increasing frequency */
> >>+ head = &dev_opp->opp_list;
> >>+ list_for_each_entry_rcu(opp, &dev_opp->opp_list, node) {
> >>+ if (new_opp->rate < opp->rate)
> >>+ break;
> >>+ else
> >>+ head = &opp->node;
> >>+ }
> >>+ rcu_read_unlock();
> >>+
> >>+ list_add_rcu(&new_opp->node, head);
> >>+ mutex_unlock(&dev_opp->lock);
> >>+ synchronize_rcu();
> >
> >Ditto.
> thx.. will fix.
>
> >
> >>+ return 0;
> >>+}
> >>+
> >>+/**
> >>+ * opp_set_availability() - helper to set the availability of an opp
> >>+ * @opp: Pointer to opp
> >>+ * @availability_req: availability status requested for this opp
> >>+ *
> >>+ * Set the availability of an OPP with an RCU operation, opp_{enable,disable}
> >>+ * share a common logic which is isolated here.
> >>+ *
> >>+ * Returns -EINVAL for bad pointers, -ENOMEM if no memory available for the
> >>+ * copy operation, returns 0 if no modifcation was done OR modification was
> >>+ * successful.
> >>+ */
> >>+static int opp_set_availability(struct opp *opp, bool availability_req)
> >>+{
> >>+ struct opp *new_opp, *tmp_opp;
> >>+ bool is_available;
> >>+
> >>+ if (unlikely(!opp || IS_ERR(opp))) {
> >>+ pr_err("%s: Invalid parameters being passed\n", __func__);
> >>+ return -EINVAL;
> >>+ }
> >>+
> >>+ new_opp = kmalloc(sizeof(struct opp), GFP_KERNEL);
> >>+ if (!new_opp) {
> >>+ pr_warning("%s: unable to allocate opp\n", __func__);
> >>+ return -ENOMEM;
> >>+ }
> >>+
> >>+ mutex_lock(&opp->dev_opp->lock);
> >>+
> >>+ rcu_read_lock();
> >>+ tmp_opp = rcu_dereference(opp);
> >>+ is_available = tmp_opp->available;
> >>+ rcu_read_unlock();
> >>+
> >>+ /* Is update really needed? */
> >>+ if (is_available == availability_req) {
> >>+ mutex_unlock(&opp->dev_opp->lock);
> >>+ kfree(tmp_opp);
> >>+ return 0;
> >>+ }
> >>+
> >>+ *new_opp = *opp;
> >>+ new_opp->available = availability_req;
> >>+ list_replace_rcu(&opp->node, &new_opp->node);
> >>+
> >>+ mutex_unlock(&opp->dev_opp->lock);
> >>+ synchronize_rcu();
> >
> >If you decide to rely on reference counts to fix the problem in
> >find_device_opp(), you will need to check the reference counts here.
> >Again, please see Documentation/RCU/rcuref.txt.
>
> Does the original point about not needing to free dev_opp resolve this?

For the dev_opp case, yes. However, I believe that my point is still
valid for the opp case.

> >>+ kfree(opp);
> >>+
> >>+ return 0;
> >>+}
> >>+
> >>+/**
> >>+ * opp_enable() - Enable a specific OPP
> >>+ * @opp: Pointer to opp
> >>+ *
> >>+ * Enables a provided opp. If the operation is valid, this returns 0, else the
> >>+ * corresponding error value. It is meant to be used for users an OPP available
> >>+ * after being temporarily made unavailable with opp_disable.
> >>+ *
> >>+ * Locking: RCU, mutex
> >
> >By "Locking: RCU", you presumably don't mean that the caller must do
> >an rcu_read_lock() -- this would result in a synchronize_rcu() being
> >invoked in an RCU read-side critical section, which is illegal.
> aye..thx. I will make it more verbose. Does the following sound right?
>
> Locking used internally: RCU copy-update and read_lock used, mutex
>
> and for the readers:
>
> Locking used internally: RCU read_lock used
>
> or do we need to go all verbatim about the implementation here?
>
> I intended the user to know the context in which they can call it,
> for example, since mutex is used, dont think of using this in
> interrupt context. since read_locks are already used, dont need to
> double lock it.. opp library takes care of it's own exclusivity.

I would stick to the constraints on the caller, and describe the internals
elsewhere, for example, near the data-structure definitions. But tastes
do vary on this.

> >>+ */
> >>+int opp_enable(struct opp *opp)
> >>+{
> >>+ return opp_set_availability(opp, true);
> >>+}
> >>+
> >>+/**
> >>+ * opp_disable() - Disable a specific OPP
> >>+ * @opp: Pointer to opp
> >>+ *
> >>+ * Disables a provided opp. If the operation is valid, this returns
> >>+ * 0, else the corresponding error value. It is meant to be a temporary
> >>+ * control by users to make this OPP not available until the circumstances are
> >>+ * right to make it available again (with a call to opp_enable).
> >>+ *
> >>+ * Locking: RCU, mutex
> >
> >Ditto. (And similar feedback applies elsewhere.)
> >
> >>+ */
> >>+int opp_disable(struct opp *opp)
> >>+{
> >>+ return opp_set_availability(opp, false);
> >>+}
> >>+
> >>+#ifdef CONFIG_CPU_FREQ
> >>+/**
> >>+ * opp_init_cpufreq_table() - create a cpufreq table for a device
> >>+ * @dev: device for which we do this operation
> >>+ * @table: Cpufreq table returned back to caller
> >>+ *
> >>+ * Generate a cpufreq table for a provided device- this assumes that the
> >>+ * opp list is already initialized and ready for usage.
> >>+ *
> >>+ * This function allocates required memory for the cpufreq table. It is
> >>+ * expected that the caller does the required maintenance such as freeing
> >>+ * the table as required.
> >>+ *
> >>+ * WARNING: It is important for the callers to ensure refreshing their copy of
> >>+ * the table if any of the mentioned functions have been invoked in the interim.
> >>+ *
> >>+ * Locking: RCU reader
> >>+ */
> >>+void opp_init_cpufreq_table(struct device *dev,
> >>+ struct cpufreq_frequency_table **table)
> >>+{
> >>+ struct device_opp *dev_opp;
> >>+ struct opp *opp;
> >>+ struct cpufreq_frequency_table *freq_table;
> >>+ int i = 0;
> >>+
> >>+ dev_opp = find_device_opp(dev);
> >>+ if (IS_ERR(dev_opp)) {
> >>+ pr_warning("%s: unable to find device\n", __func__);
> >>+ return;
> >>+ }
> >>+
> >>+ freq_table = kzalloc(sizeof(struct cpufreq_frequency_table) *
> >>+ (opp_get_opp_count(dev) + 1), GFP_ATOMIC);
> >>+ if (!freq_table) {
> >>+ pr_warning("%s: failed to allocate frequency table\n",
> >>+ __func__);
> >>+ return;
> >
> >How does the caller tell that the allocation failed? Should the caller
> >set the pointer passed in through the "table" argument to NULL before
> >calling this function? Or should this function set *table to NULL
> >before returning in this case?
>
> Good catch. Thanks. I would rather change the return to int and pass
> proper errors to caller so that they can handle it appropriately.

Works for me!

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/