Re: [GIT PULL] Thermal-SoC management changes for v5.2-rc1

From: Tomeu Vizoso
Date: Mon May 27 2019 - 07:29:59 EST


On Fri, 24 May 2019 at 15:54, Eduardo Valentin <edubezval@xxxxxxxxx> wrote:
>
> Hello,
>
> On Fri, May 24, 2019 at 10:23:09AM +0200, Tomeu Vizoso wrote:
> > On Fri, 24 May 2019 at 04:40, Eduardo Valentin <edubezval@xxxxxxxxx> wrote:
> > >
> > > On Thu, May 23, 2019 at 11:46:47AM +0200, Tomeu Vizoso wrote:
> > > > Hi Eduardo,
> > > >
> > > > I saw that for 5.1 [0] you included a kernelci boot report for your
> > > > tree, but not for 5.2. Have you found anything that should be improved
> > > > in KernelCI for it to be more useful to maintainers like you?
> > >
> > > Honestly, I take a couple of automated testing as input before sending
> > > my pulls to Linux: (a) my local test, (b) kernel-ci, and (c) 0-day.
> > >
> > > There was really no reason specifically for me to not add the report
> > > from kernelci, except..
> > > >
> > > > [0] https://lore.kernel.org/lkml/20190306161207.GA7365@xxxxxxxxxxxxxxxxxxxxx/
> > > >
> > > > I found about this when trying to understand why the boot on the
> > > > veyron-jaq board has been broken in 5.2-rc1.
> > > >
> > >
> > > I remember a report saying this failed, but from what I could tell from
> > > the boot log, the board booted and hit terminal. But apparently, after
> > > all reports from developers, the veyron-jaq boards were in a hang state.
> > >
> > > That was hard for me to tell from your logs, as they looked like
> > > a regular boot that hits terminal.
> > >
> > > Maybe I should have looked for a specific output of a command you guys
> > > run, saying "successful boot" somewhere?
> >
> > I think what is easiest and clearest is to consider the bisection
> > reports as a very strong indication that something is quite wrong in
> > the branch.
>
> OK. I hear you.
>
> >
> > Because if a board stopped booting and the bisection found a
> > suspicious patch, and reverting it gets the board booting again, then
> > chances are very high that the patch in question broke that boot.
> >
>
>
> Yeah, for sure If I had understood the report properly I could have
> nacked the patch.
>
> > Do you think the wording could be improved to make it clearer? Or
> > maybe some other changes to make all this more useful to maintainers
> > like you?
> >
>
> Well, from my perspective, I need to judge if the failure on your report
> is really related to my changes. Many times, specially on build errors,
> we get failures that are unrelated. Build errors are more straight
> forward do judge. Similarly, we need to find out if a boot issue is
> caused by a change on the branch or something existing. On boot issues
> from kernelci reports, I think the false negatives I have been seeing
> is lab/boards failing to boot. Those can also be easy to spot as the
> in most cases the kernel wont even load.
>
> For this particular case, as I described before, the kernel would
> load and hit the shell command line, but in fact it was in hang state
> IIRC. That is probably why it has not straight forward to understand
> from the log. Maybe a successful boot message somewhere would have
> helped to spot the problem (or the opposite of it, something
> saying, I was expecting to execute a command and board was
> unresponsive).

Ah, I think I see now what you meant. In the boot log linked below,
near the end we have:

22:14:41.981401 ShellCommand command timed out.: Sending # in case of
corruption. Connection timeout 00:04:25, retry in 00:02:12
22:14:42.083594 #

The # character is sent by the LAVA machine that the DUT is connected
to, in the hope that a userspace shell would reply with something.

But the kernel failed to boot to userspace, so we have this at the end:

22:16:54.558087 depthcharge-retry failed: 1 of 1 attempts.
'auto-login-action timed out after 285 seconds'
22:16:54.560479 depthcharge-action failed: 1 of 1 attempts.
'auto-login-action timed out after 285 seconds'
22:16:54.855023 JobError: Your job cannot terminate cleanly.

https://storage.kernelci.org/evalenti/for-kernelci/v5.1-rc6-58-gbe827ffd38ea/arm/multi_v7_defconfig/gcc-8/lab-collabora/boot-rk3288-veyron-jaq.html

Do you think that's the source of the confusion?

Thanks,

Tomeu