Re: rmk/for-next bisection: baseline.login on bcm2836-rpi-2-b

From: Guillaume Tucker
Date: Mon Nov 16 2020 - 17:13:48 EST


On 16/11/2020 12:20, Ard Biesheuvel wrote:
> On Mon, 16 Nov 2020 at 12:20, Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
>>
>> On Sun, 15 Nov 2020 at 15:11, Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
>>>
>>> On Fri, 13 Nov 2020 at 17:25, Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
>>>>
>>>> On Fri, 13 Nov 2020 at 17:15, Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
>>>>>
>>>>> On Fri, 13 Nov 2020 at 16:58, Russell King - ARM Linux admin
>>>>> <linux@xxxxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> On Fri, Nov 13, 2020 at 03:43:27PM +0000, Guillaume Tucker wrote:
>>>>>>> On 13/11/2020 10:35, Ard Biesheuvel wrote:
>>>>>>>> On Fri, 13 Nov 2020 at 11:31, Guillaume Tucker
>>>>>>>> <guillaume.tucker@xxxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> Hi Ard,
>>>>>>>>>
>>>>>>>>> Please see the bisection report below about a boot failure on
>>>>>>>>> RPi-2b.
>>>>>>>>>
>>>>>>>>> Reports aren't automatically sent to the public while we're
>>>>>>>>> trialing new bisection features on kernelci.org but this one
>>>>>>>>> looks valid.
>>>>>>>>>
>>>>>>>>> There's nothing in the serial console log, probably because it's
>>>>>>>>> crashing too early during boot. I'm not sure if other platforms
>>>>>>>>> on kernelci.org were hit by this in the same way, but there
>>>>>>>>> doesn't seem to be any.
>>>>>>>>>
>>>>>>>>> The same regression can be see on rmk's for-next branch as well
>>>>>>>>> as in linux-next. It happens with both bcm2835_defconfig and
>>>>>>>>> multi_v7_defconfig.
>>>>>>>>>
>>>>>>>>> Some more details can be found here:
>>>>>>>>>
>>>>>>>>> https://kernelci.org/test/case/id/5fae44823818ee918adb8864/
>>>>>>>>>
>>>>>>>>> If this looks like a real issue but you don't have a platform at
>>>>>>>>> hand to reproduce it, please let us know if you would like the
>>>>>>>>> KernelCI test to be re-run with earlyprintk or some debug config
>>>>>>>>> turned on, or if you have a fix to try.
>>>>>>>>>
>>>>>>>>> Best wishes,
>>>>>>>>> Guillaume
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hello Guillaume,
>>>>>>>>
>>>>>>>> That patch did have an issue, but it was already fixed by
>>>>>>>>
>>>>>>>> https://www.armlinux.org.uk/developer/patches/viewpatch.php?id=9020/1
>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=fc2933c133744305236793025b00c2f7d258b687
>>>>>>>>
>>>>>>>> Could you please double check whether cherry-picking that on top of
>>>>>>>> the first bad commit fixes the problem?
>>>>>>>
>>>>>>> Sadly this doesn't appear to be fixing the issue. I've
>>>>>>> cherry-picked your patch on top of the commit found by the
>>>>>>> bisection but it still didn't boot, here's the git log
>>>>>>>
>>>>>>> cbb9656e83ca ARM: 9020/1: mm: use correct section size macro to describe the FDT virtual address
>>>>>>> 7a1be318f579 ARM: 9012/1: move device tree mapping out of linear region
>>>>>>> e9a2f8b599d0 ARM: 9011/1: centralize phys-to-virt conversion of DT/ATAGS address
>>>>>>> 3650b228f83a Linux 5.10-rc1
>>>>>>>
>>>>>>> Test log: https://people.collabora.com/~gtucker/lava/boot/rpi-2-b/v5.10-rc1-3-gcbb9656e83ca/
>>>>>>>
>>>>>>> There's no output so it's hard to tell what is going on, but
>>>>>>> reverting the bad commmit does make the board to boot (that's
>>>>>>> what "revert: PASS" means in the bisect report). So it's
>>>>>>> unlikely that there is another issue causing the boot failure.
>>>>>>
>>>>>> These silent boot failures are precisely what the DEBUG_LL stuff (and
>>>>>> early_printk) is supposed to help with - getting the kernel messages
>>>>>> out when there is an oops before the serial console is initialised.
>>>>>>
>>>>>
>>>>> If this is indeed related to the FDT mapping, I would assume
>>>>> earlycon=... to be usable here.
>>>>>
>>>>> I will try to reproduce this on a RPi3 but I don't have a RPi2 at
>>>>> hand, unfortunately.
>>>>>
>>>>> Would you mind having a quick try whether you can reproduce this on
>>>>> QEMU, using the raspi2 machine model? If so, that would be a *lot*
>>>>> easier to diagnose.
>>>>
>>>> Also, please have a go with 'earlycon=pl011,0x3f201000' added to the
>>>> kernel command line.
>>>
>>> I cannot reproduce this - I don't have the exact same hardware, but
>>> for booting the kernel, I think RPi2 and RPi3 should be sufficiently
>>> similar, and I can boot on Rpi3 using a u-boot built for rpi2 using
>>> your provided dtb for RPi2.

There's a RPi 3b in BayLibre's lab and it's booting fine, here in
32-bit mode from the exact same kernel build as the earlier link
showing the problem on RPi 2b:

https://kernelci.org/test/plan/id/5fae3fa710b60de7d2db8859/

>>> What puzzles me is that u-boot reports itself as
>>>
>>> U-Boot 2016.03-rc1-00131-g39af3d8-dirty
>>>
>>> RPI Model B+ (0x10)
>>>
>>> which is the ARMv6 model not the ARMv7, but then the kernel reports
>>>
>>> CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c53c7d

That is rather puzzling indeed. Either the bootloader is wrong,
or we're booting a RPi 1B+ with a RPi 2B device tree... but I
wouldn't expect that to be compatible. I'm pretty sure it is
really a RPI 2B (BCM2836 ARMv7) and for some reason the
bootloader is printing the wrong message. I can try to take a
look at the version of u-boot that was flashed on that board.

>> Another thing I noticed is that the bootloader on these boards loads
>> the FDT at address 0x100, which is described by the FDT itself as
>> reserved memory, and which typically holds the spin tables used for
>> SMP boot.
>>
>> Could you try loading the DT elsewhere, and see if that changes anything?

OK, I think it's worth trying that in any case. I'll see if I
can do it tomorrow. I'll just have to stop kernelci.org tests on
that board while changing the bootloader configuration to avoid
inconsistent behaviour.

> I think I narrowed this down to the early DT mapping code, which
> considers any DT address that falls inside the first section as 'no
> DT', and then relies on the first section mapping of the decompressed
> kernel to cover it instead.
>
> Could you please try the following change?
>
>
> diff --git a/arch/arm/kernel/head.S b/arch/arm/kernel/head.S
> index 28687fd1240a..7f62c5eccdf3 100644
> --- a/arch/arm/kernel/head.S
> +++ b/arch/arm/kernel/head.S
> @@ -265,10 +265,10 @@ __create_page_tables:
> * We map 2 sections in case the ATAGs/DTB crosses a section boundary.
> */
> mov r0, r2, lsr #SECTION_SHIFT
> - movs r0, r0, lsl #SECTION_SHIFT
> + cmp r2, #0
> ldrne r3, =FDT_FIXED_BASE >> (SECTION_SHIFT - PMD_ORDER)
> addne r3, r3, r4
> - orrne r6, r7, r0
> + orrne r6, r7, r0, lsl #SECTION_SHIFT
> strne r6, [r3], #1 << PMD_ORDER
> addne r6, r6, #1 << SECTION_SHIFT
> strne r6, [r3]
>

The kernel is now starting to boot with this change (from the
patch you sent today), but then it fails to load the ramdisk.
Here's the git history I have now:

7d4093dffe16 ARM: head.S: explicitly map DT even if it lives in the first physical section
7a1be318f579 ARM: 9012/1: move device tree mapping out of linear region
e9a2f8b599d0 ARM: 9011/1: centralize phys-to-virt conversion of DT/ATAGS address
3650b228f83a Linux 5.10-rc1

Test log: https://people.collabora.com/~gtucker/lava/boot/rpi-2-b/v5.10-rc1-3-g7d4093dffe16/2830681.log
Plain log: https://people.collabora.com/~gtucker/lava/boot/rpi-2-b/v5.10-rc1-3-g7d4093dffe16/2830681-console.log

Thanks,
Guillaume