Re: [LKP] Re: [x86/mce] 1de08dccd3: will-it-scale.per_process_ops -14.1% regression

From: Feng Tang
Date: Sun Aug 30 2020 - 22:16:50 EST


On Fri, Aug 28, 2020 at 07:48:39PM +0200, Borislav Petkov wrote:
> On Tue, Aug 25, 2020 at 02:23:05PM +0800, Feng Tang wrote:
> > Also one good news is, we seem to identify the 2 key percpu variables
> > out of the list mentioned in previous email:
> > 'arch_freq_scale'
> > 'tsc_adjust'
> >
> > These 2 variables are accessed in 2 hot call stacks (for this 288 CPU
> > Xeon Phi platform):
> >
> > - arch_freq_scale is accessed in scheduler tick
> > arch_scale_freq_tick+0xaf/0xc0
> > scheduler_tick+0x39/0x100
> > update_process_times+0x3c/0x50
> > tick_sched_handle+0x22/0x60
> > tick_sched_timer+0x37/0x70
> > __hrtimer_run_queues+0xfc/0x2a0
> > hrtimer_interrupt+0x122/0x270
> > smp_apic_timer_interrupt+0x6a/0x150
> > apic_timer_interrupt+0xf/0x20
> >
> > - tsc_adjust is accessed in idle entrance
> > tsc_verify_tsc_adjust+0xeb/0xf0
> > arch_cpu_idle_enter+0xc/0x20
> > do_idle+0x91/0x280
> > cpu_startup_entry+0x19/0x20
> > start_kernel+0x4f4/0x516
> > secondary_startup_64+0xb6/0xc0
> >
> > From systemmap file, for bad kernel these 2 sit in one cache line, while
> > for good kernel they sit in 2 separate cache lines.
> >
> > It also explains why it turns from a regression to an improvement with
> > updated gcc/kconfig, as the cache line sharing situation is reversed.
> >
> > The direct patch I can think of is to make 'tsc_adjust' cache aligned
> > to separate these 2 'hot' variables. How do you think?
> >
> > --- a/arch/x86/kernel/tsc_sync.c
> > +++ b/arch/x86/kernel/tsc_sync.c
> > @@ -29,7 +29,7 @@ struct tsc_adjust {
> > bool warned;
> > };
> >
> > -static DEFINE_PER_CPU(struct tsc_adjust, tsc_adjust);
> > +static DEFINE_PER_CPU_ALIGNED(struct tsc_adjust, tsc_adjust);
>
> So why don't you define both variables with DEFINE_PER_CPU_ALIGNED and
> check if all your bad measurements go away this way?

For 'arch_freq_scale', there are other percpu variables in the same
smpboot.c: 'arch_prev_aperf' and 'arch_prev_mperf', and in hot path
arch_scale_freq_tick(), these 3 variables are all accessed, so I didn't
touch it. Or maybe we can align the first of these 3 variables, so
that they sit in one cacheline.

> You'd also need to check whether there's no detrimental effect from
> this change on other, i.e., !KNL platforms, and I think there won't
> be because both variables will be in separate cachelines then and all
> should be good.

Yes, these kind of changes should be verified on other platforms.

One thing still puzzles me, that the 2 variables are per-cpu things, and
there is no case of many CPU contending, why the cacheline layout matters?
I doubt it is due to the contention of the same cache set, and am trying
to find some way to test it.

Thanks,
Feng

> Hmm?
>
> --
> Regards/Gruss,
> Boris.
>
> SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg