Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

From: Thomas Gleixner
Date: Fri Nov 17 2017 - 06:36:54 EST


On Fri, 17 Nov 2017, Quan Xu wrote:
> On 2017-11-16 17:53, Thomas Gleixner wrote:
> > That's just plain wrong. We don't want to see any of this PARAVIRT crap in
> > anything outside the architecture/hypervisor interfacing code which really
> > needs it.
> >
> > The problem can and must be solved at the generic level in the first place
> > to gather the data which can be used to make such decisions.
> >
> > How that information is used might be either completely generic or requires
> > system specific variants. But as long as we don't have any information at
> > all we cannot discuss that.
> >
> > Please sit down and write up which data needs to be considered to make
> > decisions about probabilistic polling. Then we need to compare and contrast
> > that with the data which is necessary to make power/idle state decisions.
> >
> > I would be very surprised if this data would not overlap by at least 90%.
> >
> 1. which data needs to considerd to make decisions about probabilistic polling
>
> I really need to write up which data needs to considerd to make
> decisions about probabilistic polling. At last several months,
> I always focused on the data _from idle to reschedule_, then to bypass
> the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
> code inevitably.
>
> with tglx's suggestion, the data which is necessary to make power/idle
> state decisions, is the last idle state's residency time. IIUC this data
> is duration from idle to wakeup, which maybe by reschedule irq or other irq.

That's part of the picture, but not complete.

> I also test that the reschedule irq overlap by more than 90% (trace the
> need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
> one minute.
>
> as the overlap, I think I can input the last idle state's residency time
> to make decisions about probabilistic polling, as @dev->last_residency does.
> it is much easier to get data.

That's only true for your particular use case.

>
> 2. do a HV specific idle driver (function)
>
> so far, power management is not exposed to guest.. idle is simple for KVM
> guest,
> calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
> thanks Xen guys, who has implemented the paravirt framework. I can implement
> it
> as easy as following:
>
>              --- a/arch/x86/kernel/kvm.c

Your email client is using a very strange formatting.

>              +++ b/arch/x86/kernel/kvm.c
>              @@ -465,6 +465,12 @@ static void __init kvm_apf_trap_init(void)
>                      update_intr_gate(X86_TRAP_PF, async_page_fault);
>               }
>
>              +static __cpuidle void kvm_safe_halt(void)
>              +{
>          +        /* 1. POLL, if need_resched() --> return */
>          +
>              +        asm volatile("sti; hlt": : :"memory"); /* 2. halt */
>              +
>          +        /* 3. get the last idle state's residency time */
>              +
>          +        /* 4. update poll duration based on last idle state's
> residency time */
>              +}
>              +
>               void __init kvm_guest_init(void)
>               {
>                      int i;
>              @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
>                      if (kvmclock_vsyscall)
>                              kvm_setup_vsyscall_timeinfo();
>
>              +       pv_irq_ops.safe_halt = kvm_safe_halt;
>              +
>               #ifdef CONFIG_SMP
>
>
> then, I am no need to introduce a new pvops, and never modify
> schedule/idle/nohz code again.
> also I can narrow all of the code down in arch/x86/kernel/kvm.c.
>
> If this is in the right direction, I will send a new patch set next week..

This is definitely better than what you proposed so far and implementing it
as a prove of concept seems to be worthwhile.

But I doubt that this is the final solution. It's not generic and not
necessarily suitable for all use case scenarios.

Thanks,

tglx