Re: [Problem] Cache line starvation

From: Kurt Kanzenbach
Date: Thu Sep 27 2018 - 10:25:57 EST


Hi Will,

On Wed, Sep 26, 2018 at 01:53:02PM +0100, Will Deacon wrote:
> Hi all,
>
> On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > cores).
> >
> > Instrumentation show always the picture:
> >
> > CPU0 CPU1
> > => do_syscall_64 => do_syscall_64
> > => SyS_ptrace => syscall_slow_exit_work
> > => ptrace_check_attach => ptrace_do_notify / rt_read_unlock
> > => wait_task_inactive rt_spin_lock_slowunlock()
> > -> while task_running() __rt_mutex_unlock_common()
> > / check_task_state() mark_wakeup_next_waiter()
> > | raw_spin_lock_irq(&p->pi_lock); raw_spin_lock(&current->pi_lock);
> > | . .
> > | raw_spin_unlock_irq(&p->pi_lock); .
> > \ cpu_relax() .
> > - .
> > *IRQ* <lock acquired>
> >
> > In the error case we observe that the while() loop is repeated more than
> > 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> > other side does not make progress waiting for the same lock with interrupts
> > disabled.
> >
> > This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ
> > the other CPU is able to acquire pi_lock and the situation relaxes.
> >
> > Peter suggested to do a clwb(&p->pi_lock); before the cpu_relax() in
> > wait_task_inactive() which on both the Core2Duo and the SKL gets runtime
> > patched to clflush(). That hides it as well.
>
> Given the broadcast nature of cache-flushing, I'd be pretty nervous about
> adding it on anything other than a case-by-case basis. That doesn't sound
> like something we'd want to maintain... It would also be interesting to know
> whether the problem is actually before the cache (i.e. if the lock actually
> sits in the store buffer on CPU0). Does MFENCE/DSB after the unlock() help at
> all?
>
> We've previously seen something similar to this on arm64 in big/little
> systems where the big cores can loop around and re-take a spinlock before
> the little guys can get in the queue or take a ticket. I bodged that in
> cpu_relax(), but there's a magic heuristic which I couldn't figure out how
> to specify:
>
> https://lkml.org/lkml/2017/7/28/172
>
> For A72 (which is the core I think you're using) it would be interesting to
> try both:
>
> (1) Removing the prfm instruction from spin_lock(), and
> (2) Setting bit 42 of CPUACTLR_EL1 on each CPU (probably needs a
> firmware change)

correct, we use the Cortex A72.

I followed your suggestions. I've removed the prefetch instructions from
the spin lock implementation in the v4.9 kernel. In addition I've
modified armv8/start.S in U-Boot to setup bit 42 in CPUACTLR_EL1
(S3_1_c15_c2_0). We've also made sure, that this bit is actually written
for each CPU by reading their register value in the kernel.

However, the issue still triggers fine. With stress-ng we're able to
generate latency in millisecond range. The only workaround we've found
so far is to add a "delay" in cpu_relax().

Any ideas, what we can test further?

Thanks,
Kurt

>
> That should prevent the lock() operation from speculatively pulling in the
> cacheline in a unique state.
>
> More recent Arm CPUs have atomic instructions which, apart from CAS,
> *should* avoid this starvation issue entirely.
>
> Will
>