Re: [linux-pm] [PATCHv3 3/5] cpuidle: add support for states that affect multiple cpus
From: Rafael J. Wysocki
Date: Fri May 04 2012 - 18:23:07 EST
On Friday, May 04, 2012, Colin Cross wrote:
> On Fri, May 4, 2012 at 4:51 AM, Rafael J. Wysocki <rjw@xxxxxxx> wrote:
> > On Friday, May 04, 2012, Colin Cross wrote:
> >> On Thu, May 3, 2012 at 3:14 PM, Rafael J. Wysocki <rjw@xxxxxxx> wrote:
> > [...]
> >>
> >> >> +/**
> >> >> + * cpuidle_coupled_cpus_waiting - check if all cpus in a coupled set are waiting
> >> >> + * @coupled: the struct coupled that contains the current cpu
> >> >> + *
> >> >> + * Returns true if all cpus coupled to this target state are in the wait loop
> >> >> + */
> >> >> +static inline bool cpuidle_coupled_cpus_waiting(struct cpuidle_coupled *coupled)
> >> >> +{
> >> >> + int alive;
> >> >> + int waiting;
> >> >> +
> >> >> + /*
> >> >> + * Read alive before reading waiting so a booting cpu is not treated as
> >> >> + * idle
> >> >> + */
> >> >
> >> > Well, the comment doesn't really explain much. In particular, why the boot CPU
> >> > could be treated as idle if the reads were in a different order.
> >>
> >> Hm, I think the race condition is on a cpu going down. What about:
> >> Read alive before reading waiting. If waiting is read before alive,
> >> this cpu could see another cpu as waiting just before it goes offline,
> >> between when it the other cpu decrements waiting and when it
> >> decrements alive, which could cause alive == waiting when one cpu is
> >> not waiting.
> >
> > Reading them in this particular order doesn't stop the race, though. I mean,
> > if the hotplug happens just right after you've read alive_count, you still have
> > a wrong value. waiting_count is set independently, it seems, so there's no
> > ordering between the two on the "store" side and the "load" side ordering
> > doesn't matter.
>
> As commented in the hotplug path, hotplug relies on the fact that one
> of the cpus in the cluster is involved in the hotplug of the cpu that
> is changing (this may not be true for multiple clusters, but it is
> easy to fix by IPI-ing to a cpu that is in the same cluster when that
> happens).
That's very fragile and potentially sets a trap for people trying to make
the kernel work on systems with multiple clusters.
> That means that waiting count is always guaranteed to be at
> least 1 less than alive count when alive count changes. All this read
> ordering needs to do is make sure that this cpu doesn't see
> waiting_count == alive_count by reading them in the wrong order.
So, the concern seems to be that if the local CPU reorders the reads
from waiting_count and alive_count and enough time elapses between one
read and the other, the decrementation of waiting_count may happen
between them and then the CPU may use the outdated value for comparison,
right?
Still, though, even if the barrier is there, the modification of
alive_count in the hotplug notifier routine may not happen before
the read from alive_count in cpuidle_coupled_cpus_waiting() is completed.
Isn't that a problem?
> > I would just make the CPU hotplug notifier routine block until
> > cpuidle_enter_state_coupled() is done and the latter return immediately
> > if the CPU hotplug notifier routine is in progress, perhaps falling back
> > to the safe state. Or I would make the CPU hotplug notifier routine
> > disable the "coupled cpuidle" entirely on DOWN_PREPARE and UP_PREPARE
> > and only re-enable it after the hotplug has been completed.
>
> I'll take a look at disabling coupled idle completely during hotplug.
Great, thanks!
> >> >> + alive = atomic_read(&coupled->alive_count);
> >> >> + smp_rmb();
> >> >> + waiting = atomic_read(&coupled->waiting_count);
> >> >
> >> > Have you considered using one atomic variable to accommodate both counters
> >> > such that the upper half contains one counter and the lower half contains
> >> > the other?
> >>
> >> There are 3 counters (alive, waiting, and ready). Do you want me to
> >> squish all of them into a single atomic_t, which would limit to 1023
> >> cpus?
> >
> > No. I'd make sure that cpuidle_enter_state_coupled() did't race with CPU
> > hotplug, so as to make alive_count stable from its standpoint, and I'd
> > put the two remaining counters into one atomic_t variable.
>
> I'll take a look at using a single atomic_t. My initial worry was
> that the increased contention on the shared variable would cause more
> cmpxchg retries, but since waiting_count and ready_count are designed
> to be modified in sequential phases that shouldn't be an issue.
>
[...]
> >> >> + while (!need_resched() && !cpuidle_coupled_cpus_waiting(coupled)) {
> >> >> + entered_state = cpuidle_enter_state(dev, drv,
> >> >> + dev->safe_state_index);
> >> >> +
> >> >> + local_irq_enable();
> >> >> + while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
> >> >> + cpu_relax();
> >> >
> >> > Hmm. What exactly is this loop supposed to achieve?
> >> This is to ensure that the outstanding wakeups have been processed so
> >> we don't go to idle with an interrupt pending an immediately wake up.
> >
> > I see. Is it actually safe to reenable interrupts at this point, though?
>
> I think so. The normal idle loop will enable interrupts in a similar
> fashion to what happens here. There are two things to worry about: a
> processed interrupt causing work to be scheduled that should bring
> this cpu out of idle, or changing the next timer which would
> invalidate the current requested state. The first is handled by
> checking need_resched() after interrupts are disabled again, the
> second is currently unhandled but does not affect correct operation,
> it just races into a less-than-optimal idle state.
I see.
> >> >> + local_irq_disable();
> >> >
> >> > Anyway, you seem to be calling it twice along with this enabling/disabling of
> >> > interrupts. I'd put that into a separate function and explain its role in a
> >> > kerneldoc comment.
> >>
> >> I left it here to be obvious that I was enabling interrupts in the
> >> idle path, but I can refactor it out if you prefer.
> >
> > Well, you can call the function to make it obvious. :-)
> >
> > Anyway, I think that code duplication is a worse thing than a reasonable
> > amount of non-obviousness, so to speak.
> >
> >> >> + }
> >> >> +
> >> >> + /* give a chance to process any remaining pokes */
> >> >> + local_irq_enable();
> >> >> + while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
> >> >> + cpu_relax();
> >> >> + local_irq_disable();
> >> >> +
> >> >> + if (need_resched()) {
> >> >> + cpuidle_coupled_set_not_waiting(dev, coupled);
> >> >> + goto out;
> >> >> + }
> >> >> +
> >> >> + /*
> >> >> + * All coupled cpus are probably idle. There is a small chance that
> >> >> + * one of the other cpus just became active. Increment a counter when
> >> >> + * ready, and spin until all coupled cpus have incremented the counter.
> >> >> + * Once a cpu has incremented the counter, it cannot abort idle and must
> >> >> + * spin until either the count has hit alive_count, or another cpu
> >> >> + * leaves idle.
> >> >> + */
> >> >> +
> >> >> + smp_mb__before_atomic_inc();
> >> >> + atomic_inc(&coupled->ready_count);
> >> >> + smp_mb__after_atomic_inc();
> >> >
> >> > It seems that at least one of these barriers is unnecessary ...
> >> The first is to ensure ordering between ready_count and waiting count,
> >
> > Are you afraid that the test against waiting_count from
> > cpuidle_coupled_cpus_waiting() may get reordered after the incrementation
> > of ready_count or is it something else?
>
> Yes, ready_count must not be incremented before waiting_count == alive_count.
Well, control doesn't reach the atomic_inc() statement if this condition
is not satisfied, so I don't see how it can be possibly reordered before
the while () loop without breaking the control flow guarantees.
> >> the second is for ready_count vs. alive_count and requested_state.
> >
> > This one I can understand, but ...
> >
> >> >> + /* alive_count can't change while ready_count > 0 */
> >> >> + alive = atomic_read(&coupled->alive_count);
> >
> > What happens if CPU hotplug happens right here?
>
> According to the comment above that line that can't happen -
> alive_count can't change while ready_count > 0, because that implies
> that all cpus are waiting and none can be in the hotplug path where
> alive_count is changed. Looking at it again that is not entirely
> true, alive_count could change on systems with >2 cpus, but I think it
> can't cause an issue because alive_count would be 2 greater than
> waiting_count before alive_count was changed. Either way, it will be
> fixed by disabling coupled idle during hotplug.
Yup.
> >> >> + while (atomic_read(&coupled->ready_count) != alive) {
> >> >> + /* Check if any other cpus bailed out of idle. */
> >> >> + if (!cpuidle_coupled_cpus_waiting(coupled)) {
> >> >> + atomic_dec(&coupled->ready_count);
> >> >> + smp_mb__after_atomic_dec();
> >
> > And the barrier here? Even if the old value of ready_count leaks into
> > the while () loop after retry, that doesn't seem to matter.
>
> All of these will be academic if ready_count and waiting_count share
> an atomic_t.
> waiting_count must not be decremented by exiting the while loop after
> the retry label until ready_count is decremented here, but that is
> also protected by the barrier in set_not_waiting. One of them could
> be dropped.
>
> >> >> + goto retry;
> >> >> + }
> >> >> +
> >> >> + cpu_relax();
> >> >> + }
> >> >> +
> >> >> + /* all cpus have acked the coupled state */
> >> >> + smp_rmb();
> >> >
> >> > What is the barrier here for?
> >> This protects ready_count vs. requested_state. It is already
> >> implicitly protected by the atomic_inc_return in set_waiting, but I
> >> thought it would be better to protect it explicitly here. I think I
> >> added the smp_mb__after_atomic_inc above later, which makes this one
> >> superflous, so I'll drop it.
> >
> > OK
> >
> >> >> +
> >> >> + next_state = cpuidle_coupled_get_state(dev, coupled);
> >> >> +
> >> >> + entered_state = cpuidle_enter_state(dev, drv, next_state);
> >> >> +
> >> >> + cpuidle_coupled_set_not_waiting(dev, coupled);
> >> >> + atomic_dec(&coupled->ready_count);
> >> >> + smp_mb__after_atomic_dec();
> >> >> +
> >> >> +out:
> >> >> + /*
> >> >> + * Normal cpuidle states are expected to return with irqs enabled.
> >> >> + * That leads to an inefficiency where a cpu receiving an interrupt
> >> >> + * that brings it out of idle will process that interrupt before
> >> >> + * exiting the idle enter function and decrementing ready_count. All
> >> >> + * other cpus will need to spin waiting for the cpu that is processing
> >> >> + * the interrupt. If the driver returns with interrupts disabled,
> >> >> + * all other cpus will loop back into the safe idle state instead of
> >> >> + * spinning, saving power.
> >> >> + *
> >> >> + * Calling local_irq_enable here allows coupled states to return with
> >> >> + * interrupts disabled, but won't cause problems for drivers that
> >> >> + * exit with interrupts enabled.
> >> >> + */
> >> >> + local_irq_enable();
> >> >> +
> >> >> + /*
> >> >> + * Wait until all coupled cpus have exited idle. There is no risk that
> >> >> + * a cpu exits and re-enters the ready state because this cpu has
> >> >> + * already decremented its waiting_count.
> >> >> + */
> >> >> + while (atomic_read(&coupled->ready_count) != 0)
> >> >> + cpu_relax();
> >> >> +
> >> >> + smp_rmb();
> >> >
> >> > And here?
> >>
> >> This was to protect ready_count vs. looping back in and reading
> >> alive_count.
> >
> > Well, I'm lost. :-)
> >
> > You've not modified anything after the previous smp_mb__after_atomic_dec(),
> > so what exactly is the reordering this is supposed to work against?
> >
> > And while we're at it, I'm not quite sure what the things that the previous
> > smp_mb__after_atomic_dec() separates from each other are.
>
> Instead of justifying all of these, let me try the combined atomic_t
> trick and justify the (many fewer) remaining barriers.
OK, cool! :-)
Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/