Re: [PATCH] mailbox: cancel timer before starting it

From: Da Xue
Date: Tue Jul 19 2022 - 03:08:34 EST


On Fri, Oct 16, 2020 at 3:27 PM Jassi Brar <jassisinghbrar@xxxxxxxxx> wrote:
>
> On Fri, Oct 16, 2020 at 1:54 PM Jerome Brunet <jbrunet@xxxxxxxxxxxx> wrote:
> >
> >
> > On Fri 16 Oct 2020 at 19:33, Jassi Brar <jassisinghbrar@xxxxxxxxx> wrote:
> >
> > > On Fri, Oct 16, 2020 at 4:00 AM Jerome Brunet <jbrunet@xxxxxxxxxxxx> wrote:
> > >>
> > >>
> > >> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@xxxxxxx> wrote:
> > >>
> > >> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> > >> > [..]
> > >> >> > >> --- a/drivers/mailbox/mailbox.c
> > >> >> > >> +++ b/drivers/mailbox/mailbox.c
> > >> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > >> >> > >> exit:
> > >> >> > >> spin_unlock_irqrestore(&chan->lock, flags);
> > >> >> > >>
> > >> >> > >> - if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > >> >> > >> - /* kick start the timer immediately to avoid delays */
> > >> >> > >> + if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > >> >> > >> + /* Disable the timer if already active ... */
> > >> >> > >> + hrtimer_cancel(&chan->mbox->poll_hrt);
> > >> >> > >> +
> > >> >> > >> + /* ... and kick start it immediately to avoid delays */
> > >> >> > >> hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > >> >> > >> + }
> > >> >> > >> }
> > >> >> > >>
> > >> >> > >> static void tx_tick(struct mbox_chan *chan, int r)
> > >> >> > >
> > >> >> > > I've tracked a regression back to this commit. Details to reproduce:
> > >> >> >
> > >> >> > Hi Ionela,
> > >> >> >
> > >> >> > I don't have access to your platform and I don't get what is going on
> > >> >> > from the log below.
> > >> >> >
> > >> >> > Could you please give us a bit more details about what is going on ?
> > >> >> >
> > >> >> > All this patch does is add hrtimer_cancel().
> > >> >> > * It is needed if the timer had already been started, which is
> > >> >> > appropriate AFAIU
> > >> >> > * It is a NO-OP is the timer is not active.
> > >> >> >
> > >> >> Can you please try using hrtimer_try_to_cancel() instead ?
> > >> >>
> > >> >
> > >> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> > >> > this limit how effective this change is? AFAIU, this will possibly only
> > >> > reduce the chances for the race condition, but not solve it.
> > >> >
> > >>
> > >> It is also my understanding, hrtimer_try_to_cancel() would remove a
> > >> timer which as not already started but would return withtout doing
> > >> anything if the callback is already running ... which is the original
> > >> problem
> > >>
> > > If we are running in the callback path, hrtimer_try_to_cancel will
> > > return -1, in which case we could skip hrtimer_start.
> > > Anyways, I think simply checking for hrtimer_active should effect the same.
> > > I have submitted a patch, of course not tested.
> >
> > Yes it sloves this race but ...
> >
> Thanks for confirmation.
>
> > If a race is possible between a timer callback rescheduling itself (which
> > is not that uncommon) and another thread trying to cancel it
> >
> In our case, we should not be cancelling+restarting the timer in the
> first place, because txdone_hrtimer will take care of it via
> hrtimer_forward_now.
>
> >, maybe
> > there is something worth fixing in hrtimer ? Also, mailbox calls
> > hrtimer_cancel() in unregister ... are we confident this would work ?
> >
> Yes. After unregister() every channel is supposed to die and so must
> its resources.
>
> -jassi

Hi,

I'm running into an issue where I get "rcu: INFO: rcu_preempt detected
stalls on CPUs/tasks" when booting Linux distributions once every 5 or
so reboots. The system sticks on systemd starting for ~20 seconds
before rcu_preempt prints a message. This only happens on Amlogic (I
tested S805X/S905X/S905D) and I do not get this problem on my
Allwinner or Rockchip boards with the same kernel and image. I have
tried 5.15, 5.18, and 5.19. I dumped a bunch of logs here from the
three kernel versions but the content is the same.
https://github.com/libre-computer-project/libretech-linux/issues/5

+ mount -n -o move /sys /root/sys
+ mount -n -o move /proc /root/proc
+ exec run-init /root /sbin/init
[ 26.485689] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 26.486217] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P1
[ 26.492332] (detected by 3, t=5252 jiffies, g=-315, q=546)
[ 26.497846] task:run-init state:R running task stack:
0 pid: 1 ppid: 0 flags:0x00000004
[ 26.507719] Call trace:
[ 26.510063] dump_backtrace.part.6+0xdc/0xe8
[ 26.514306] show_stack+0x18/0x68
[ 26.517574] sched_show_task+0x154/0x198
[ 26.521461] rcu_sched_clock_irq+0xd10/0xdd8
[ 26.525690] update_process_times+0x9c/0xd0
[ 26.529829] tick_sched_handle.isra.19+0x34/0x58
[ 26.534405] tick_sched_timer+0x4c/0xa8
[ 26.538192] __hrtimer_run_queues+0x120/0x1b8
[ 26.542510] hrtimer_interrupt+0xd8/0x238
[ 26.546474] arch_timer_handler_phys+0x2c/0x50
[ 26.550877] handle_percpu_devid_irq+0x84/0x138
[ 26.555363] generic_handle_domain_irq+0x2c/0x48
[ 26.559935] gic_handle_irq+0xa4/0xc0
[ 26.563549] call_on_irq_stack+0x2c/0x58
[ 26.567432] do_interrupt_handler+0x80/0x88
[ 26.571574] el1_interrupt+0x38/0x70
[ 26.575105] el1h_64_irq_handler+0x18/0x28
[ 26.582437] el1h_64_irq+0x64/0x68
[ 26.585712] xas_descend+0x0/0x88
[ 26.588947] xas_find+0x190/0x1d8
[ 26.592182] find_lock_entries+0x7c/0x280
[ 26.596121] truncate_inode_pages_range+0xa8/0x490
[ 26.600843] truncate_inode_pages_final+0x58/0x78
[ 26.605453] evict+0x144/0x150
[ 26.608471] iput+0x128/0x190
[ 26.611442] do_unlinkat+0x17c/0x2a8
[ 26.614817] __arm64_sys_unlinkat+0x40/0x88
[ 26.618928] invoke_syscall+0x44/0x100
[ 26.622599] el0_svc_common.constprop.3+0x6c/0xf0
[ 26.627237] do_el0_svc+0x24/0x88
[ 26.630473] el0_svc+0x20/0x60
[ 26.633447] el0t_64_sync_handler+0x90/0xb8
[ 26.637562] el0t_64_sync+0x170/0x174
[ 33.757714] VCC_CARD: disabling

I have turned on systemd debug and it does not print anything. The
message is the same every time and I rebuilt the filesystem multiple
times with both ext4 and btrfs roots across the devices.

I'm not sure if this is an extension or corner case of the origin
hr_timer problem. Any ideas?

Best,
Da