Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

From: Gautham R Shenoy
Date: Wed Sep 18 2019 - 08:31:03 EST


Hello Nathan, Michael,

On Tue, Sep 17, 2019 at 12:36:35PM -0500, Nathan Lynch wrote:
> Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx> writes:
> > On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote:
> >> "Gautham R. Shenoy" <ego@xxxxxxxxxxxxxxxxxx> writes:
> >> > The patchset also defines a new sysfs attribute
> >> > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
> >> > to allow userspace programs to change the state into which the
> >> > offlined CPU need to be put to at runtime.
> >>
> >> A boolean sysfs interface will become awkward if we need to add another
> >> mode in the future.
> >>
> >> What do you think about naming the attribute something like
> >> 'offline_mode', with the possible values 'extended-cede' and
> >> 'rtas-stopped'?
> >
> > We can do that. However, IMHO in the longer term, on PSeries guests,
> > we should have only one offline state - rtas-stopped. The reason for
> > this being, that on Linux, SMT switch is brought into effect through
> > the CPU Hotplug interface. The only state in which the SMT switch will
> > recognized by the hypervisors such as PHYP is rtas-stopped.
>
> OK. Why "longer term" though, instead of doing it now?

Because adding extended-cede into a cpuidle state is non-trivial since
a CPU in that state is non responsive to external interrupts. We will
additional changes in the IPI, Timer and the Interrupt code to ensure
that these get translated to a H_PROD in order to wake-up the target
CPU in extended CEDE.

Timer: is relatively easy since the cpuidle infrastructure has the
timer-offload framework (used for fastsleep in POWER8) where we
can offload the timers of an idling CPU to another CPU which
can wakeup the CPU when the timer expires via an IPI.

IPIs: We need to ensure that icp_hv_set_qirr() correctly sends H_IPI
or H_PROD depending on whether or not the target CPU is in
extended CEDE.

Interrupts: Either we migrate away the interrupts from the CPU that is
entering extended CEDE or we prevent a CPU that is the
sole target for an interrupt from entering extended CEDE.

The accounting problem in tools such as lparstat with
"cede_offline=on" is affecting customers who are using these tools for
capacity-planning. That problem needs a fix in the short-term, for
which Patch 1 changes the default behaviour of cede_offline from "on"
to "off". Since this patch would break the existing userspace tools
that use the CPU-Offline infrastructure to fold CPUs for saving power,
the sysfs interface allowing a runtime change of cede_offline_enabled
was provided to enable these userspace tools to cope with minimal
change.

>
>
> > All other states (such as extended-cede) should in the long-term be
> > exposed via the cpuidle interface.
> >
> > With this in mind, I made the sysfs interface boolean to mirror the
> > current "cede_offline" commandline parameter. Eventually when we have
> > only one offline-state, we can deprecate the commandline parameter as
> > well as the sysfs interface.
>
> I don't care for adding a sysfs interface that is intended from the
> beginning to become vestigial...

Fair point. Come to think of it, in case the cpuidle menu governor
behaviour doesn't match the expectations provided by the current
userspace solutions for folding idle CPUs for power-savings, it would
be useful to have this option around so that existing users who prefer
the userspace solution can still have that option.

>
> This strikes me as unnecessarily incremental if you're changing the
> default offline state. Any user space programs depending on the current
> behavior will have to change anyway (and why is it OK to break them?)
>

Yes, the current userspace program will need to be modified to check
for the sysfs interface and change the value to
cede_offline_enabled=1.

> Why isn't the plan:
>
> 1. Add extended cede support to the pseries cpuidle driver
> 2. Make stop-self the only cpu offline state for pseries (no sysfs
> interface necessary)

This is the plan, except that 1. requires some additional work and
this patchset is proposed as a short-term mitigation until we get
1. right.

>
> ?

--
Thanks and Regards
gautham.