Re: [RFC PATCH 2/4] x86, mwaitt: introduce mwaitx idle with a configurable timer

From: Huang Rui
Date: Thu May 21 2015 - 09:32:58 EST


On Tue, May 19, 2015 at 01:31:21PM +0200, Borislav Petkov wrote:
> On Tue, May 19, 2015 at 04:01:10PM +0800, Huang Rui wrote:
> > MWAITX/MWAIT does not let the cpu core go into C1 state on AMD processors.
> > The cpu core still consumes less power while waiting, and has faster exit
> > from waiting than "Halt". This patch implements an interface using the
> > kernel parameter "idle=" to configure mwaitx type and timer value.
> >
> > If "idle=mwaitx", the timeout will be set as the maximum value
> > ((2^64 - 1) * TSC cycle).
> > If "idle=mwaitx,100", the timeout will be set as 100ns.
> > If the processor doesn't support MWAITX, then halt is used.
>
> Ok, I see what you're trying here and I think this is not the optimal
> approach.
>
> So let me explain how I see it, you correct me if I'm wrong:
>
> So we want to do MWAITX so that we can save us idle entry/exit overhead
> with HLT. Because MWAITX is faster, reportedly.
>
> Now, if we want to do that, we want to do it dynamically and adjust the
> MWAITX sleep interval depending on the system, usage pattern, system
> load and so on.

Yes, that's right. Maybe, even we can also find any other use case on
kernel other components.

>
> And for that we would need an adaptive scheme which approximates each
> idle interval. Simply taking TSC before we enter idle and after we come
> out would give us each idle residency duration and we can do some simple
> math to approximate it.
>
> Now, what would that bring us: faster wakeup times.
>
> And here comes the 10^6 $ question: why are we doing all the fun?
>

Do you mention the below codes:

/*
* TSC loops (EBX input) = Timer(nsec) *
* TSC freq(khz) / 1000000
*/
timeout = timeout * tsc_freq;
do_div(timeout, 1000000);

That's because the unit of tsc_freq is KHz and the unit of timeout is
nanosecond. Then I do div 10^6 to figure out the corresponding loops.

> I'm thinking we want to find a cutoff duration where for smaller
> durations it is worth to do MWAITX and have faster entry/exit times and
> for bigger durations we want to do HLT because it'll get into C1E and
> give us higher power savings.
>
> We don't want to do MWAITX too long because that'll burn more power
> relatively to HLT but we don't want to do HLT for shorter periods
> because then entry/exit costs.
>
> Am I on the right track at least?

Yes, that's totally right.

>
> > Signed-off-by: Huang Rui <ray.huang@xxxxxxx>
> > ---
> > arch/x86/include/asm/mwait.h | 2 +
> > arch/x86/include/asm/processor.h | 2 +-
> > arch/x86/kernel/process.c | 79 ++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 82 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> > index b91136f..c4e51e7 100644
> > --- a/arch/x86/include/asm/mwait.h
> > +++ b/arch/x86/include/asm/mwait.h
> > @@ -14,6 +14,8 @@
> > #define CPUID5_ECX_INTERRUPT_BREAK 0x2
> >
> > #define MWAIT_ECX_INTERRUPT_BREAK 0x1
> > +#define MWAITX_ECX_TIMER_ENABLE 0x2
>
> Use BIT(1) here.

OK.

>
> > +#define MWAITX_EBX_WAIT_TIMEOUT 0xffffffff
> >
> > static inline void __monitor(const void *eax, unsigned long ecx,
> > unsigned long edx)
> > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> > index 23ba676..0f60e94 100644
> > --- a/arch/x86/include/asm/processor.h
> > +++ b/arch/x86/include/asm/processor.h
> > @@ -733,7 +733,7 @@ extern unsigned long boot_option_idle_override;
> > extern bool amd_e400_c1e_detected;
> >
> > enum idle_boot_override {IDLE_NO_OVERRIDE=0, IDLE_HALT, IDLE_NOMWAIT,
> > - IDLE_POLL};
> > + IDLE_POLL, IDLE_MWAITX};
> >
> > extern void enable_sep_cpu(void);
> > extern int sysenter_setup(void);
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index 6e338e3..9d68193 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -30,6 +30,7 @@
> > #include <asm/debugreg.h>
> > #include <asm/nmi.h>
> > #include <asm/tlbflush.h>
> > +#include <asm/x86_init.h>
> >
> > /*
> > * per-CPU TSS segments. Thre ads are completely 'soft' on Linux,
> > @@ -276,6 +277,7 @@ unsigned long boot_option_idle_override = IDLE_NO_OVERRIDE;
> > EXPORT_SYMBOL(boot_option_idle_override);
> >
> > static void (*x86_idle)(void);
> > +static unsigned long idle_param;
> >
> > #ifndef CONFIG_SMP
> > static inline void play_dead(void)
> > @@ -444,6 +446,17 @@ static int prefer_mwait_c1_over_halt(const struct cpuinfo_x86 *c)
> > return 1;
> > }
> >
> > +static int not_support_mwaitx(const struct cpuinfo_x86 *c)
> > +{
> > + if (c->x86_vendor != X86_VENDOR_AMD)
> > + return 1;
> > +
> > + if (!cpu_has(c, X86_FEATURE_MWAITT))
> > + return 1;
> > +
> > + return 0;
> > +}
> > +
> > /*
> > * MONITOR/MWAIT with no hints, used for default default C1 state.
> > * This invokes MWAIT with interrutps enabled and no flags,
> > @@ -470,12 +483,45 @@ static void mwait_idle(void)
> > __current_clr_polling();
> > }
> >
> > +/*
> > + * AMD Excavator processors support the new MONITORX/MWAITX instructions.
>
> No need for that especially when newer than XV processors start
> supporting those too.
>

OK, got it.

Thanks,
Rui
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/