Re: [PATCH 1/3] clocksource/mips-gic-timer: Fix rcu_sched timeouts from multithreading

From: Thomas Gleixner
Date: Thu Oct 19 2017 - 04:22:27 EST


On Thu, 19 Oct 2017, Matt Redfearn wrote:
> On 18/10/17 21:34, Thomas Gleixner wrote:
> > On Wed, 11 Oct 2017, Matt Redfearn wrote:
> > > Secondly, the fixed min delta ignores the fact that with MIPS
> > > multithreading active, execution resource within a core is shared
> > > between the hardware threads within that core. An inconvenienly timed
> > > switch of executing thread within gic_next_event, between the read and
> > > write of updated count, can result in the CPU writing an event in the
> > > past, and subsequently not receiving a tick interrupt until the counter
> > > wraps. This stalls the CPU from the RCU scheduler. Other CPUs detect
> > > this and print rcu_sched timeout messages in the kernel log. It can
> > > lead to other issues as well if the CPU is holding locks or other
> > > resources at the point at which it stalls. Fix this by scaling the min
> > > delta for the timer based on the number of threads in the core
> > > (smp_num_siblings). This accounts for the greater average runtime of
> > > CPUs within a multithreading core.
> >
> > I don't understand why this is not catched by the check at the end of the
> > next_event() function:
> >
> > res = ((int)(gic_read_count() - cnt) >= 0) ? -ETIME : 0;
> >
> > Btw, the local_irq_save() in this function is pointless as this function is
> > always called with interrupts disabled from the core code.
>
> This is an issue because in some cases (hrtimer_reprogram ->
> clockevents_program_event -> clockevents_program_min_delta, when
> CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n) there is no retry performed in the
> case of -ETIME. There has been a patch pending for some time
> https://patchwork.kernel.org/patch/8909491/ which ought to address this and
> retry in the case of an event in the past on this call path. But in the
> meantime this patch vastly improves the situation.

I somehow missed that one. Care to repost so we get that solved at the
place where it should be solved.

Thanks,

tglx