So back to the original problem, which in short, revolves around these two
functions (with comments included by me):
/* called with kernfs rwsem for this kobj held */
static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
{
struct cpufreq_policy *policy = to_policy(kobj);
struct freq_attr *fattr = to_attr(attr);
ssize_t ret;
if (!down_read_trylock(&cpufreq_rwsem))
return -EINVAL;
down_read(&policy->rwsem);
and
static ssize_t store(struct kobject *kobj, struct attribute *attr,
const char *buf, size_t count)
{
struct cpufreq_policy *policy = to_policy(kobj);
struct freq_attr *fattr = to_attr(attr);
ssize_t ret = -EINVAL;
get_online_cpus();
if (!cpu_online(policy->cpu))
goto unlock;
if (!down_read_trylock(&cpufreq_rwsem))
goto unlock;
down_write(&policy->rwsem);
/* if governor switch, calls sysfs_remove_group
* and acquires kernfs rwsem for this kobj */
There's another bug here which I haven't had a chance to discuss. There's the
possibility that we have multiple threads waiting at the store's
down_write(&policy->rwsem) when another thread does a governor switch. When
the policy->rwsem is released by the governor switch thread, all the other
threads will enter this code path and race through while looking at stale data.
We hit some NULL pointers (very similar to the ones originally reported in this
thread) and, of course, the system dies.
I wonder if the show() down_read(&policy->rwsem) needs to be a
down_read_trylock(), and similarily in the store the down_write(&policy->rwsem)
needs to be a down_write_trylock() ?
We would also have to do a check on policy->governor_enabled to verify that
the data was still valid after taking the lock.
*If* we were to make this change, does that also happen to fix the risk of a
deadlock in this code?
That might be too hacky ... gotta be a better way :/ ...
Anyway, just a thought.