Re: qemu:metag image runtime failure in -next due to 'kthread: allow to cancel kthread work'

From: Kees Cook
Date: Mon Sep 19 2016 - 15:53:57 EST


On Mon, Sep 19, 2016 at 6:59 AM, James Hogan <james.hogan@xxxxxxxxxx> wrote:
> On Fri, Sep 16, 2016 at 07:58:12PM -0700, Kees Cook wrote:
>> On Fri, Sep 16, 2016 at 4:32 PM, James Hogan <james.hogan@xxxxxxxxxx> wrote:
>> > On Fri, Sep 16, 2016 at 02:37:18PM -0700, Guenter Roeck wrote:
>> >> On Fri, Sep 16, 2016 at 10:27:20PM +0100, James Hogan wrote:
>> >> > On Fri, Sep 16, 2016 at 01:38:19PM -0700, Guenter Roeck wrote:
>> >> > > Hi,
>> >> > >
>> >> > > I see the following runtime error in -next when running a metag qemu emulation.
>> >> > >
>> >> > > [ ... ]
>> >> > > workingset: timestamp_bits=30 max_order=16 bucket_order=0
>> >> > > io scheduler noop registered (default)
>> >> > > brd: module loaded
>> >> > > Warning: unable to open an initial console.
>> >> > > List of all partitions:
>> >> > > 0100 16384 ram0 (driver?)
>> >> > > No filesystem could mount root, tried:
>> >> > > Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(1,0)
>> >> > >
>> >> > > An example for a complete log is at:
>> >> > > http://kerneltests.org/builders/qemu-metag-next/builds/489/steps/qemubuildcommand/logs/stdio
>> >> > >
>> >> > > bisect points to commit ef98de028afde ("kthread: allow to cancel kthread work").
>> >> > > I don't know (yet) if other architectures are affected. bisect log is attached.
>> >> > >
>> >> > > The scripts to run this test are available at
>> >> > > https://github.com/groeck/linux-build-test/tree/master/rootfs/metag.
>> >> > >
>> >> > > Guenter
>> >> >
>> >> > Thanks Guenter,
>> >> >
>> >> > It appears to be related to the command line. After that commit the
>> >> > command line is shown as empty (rather than your "rdinit=/sbin/init
>> >> > doreboot"), but it can still be overridden in the config and then it
>> >> > continues to work.
>> >> >
>> >> Weird.
>> >
>> > Previously the Elf had a single load program header:
>> >
>> > Program Headers:
>> > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
>> > LOAD 0x004000 0x40000000 0x40000000 0x34b304 0x376230 RWE 0x4000
>> > NOTE 0x1ecc08 0x401e8c08 0x401e8c08 0x00000 0x00000 R 0x4
>> >
>> > QEMU puts the args at 40376230, straight after the load region (unlike a
>> > real Meta Linux bootloader).
>> >
>> > After the above commit the ELF gets two load program headers, with a
>> > small alignment gap in the middle:
>> >
>> > Program Headers:
>> > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
>> > LOAD 0x004000 0x40000000 0x40000000 0x18f118 0x18f118 R E 0x4000
>> > LOAD 0x194000 0x40190000 0x40190000 0x1bd304 0x1e8230 RWE 0x4000
>> > NOTE 0x1eec08 0x401eac08 0x401eac08 0x00000 0x00000 R 0x4
>> >
>> > Here this version of QEMU puts the args at where it thinks the end of
>> > the loaded image is, which is based on the number of bytes copied from
>> > the ELF, i.e. the total MemSiz's, not taking into account the alignment
>> > gap in between, so it puts them at 0x40377348.
>> >
>> > But of course:
>> > 40378230 B ___bss_stop
>> >
>> > so it wipes them out while clearing bss during early init.
>> >
>> > Previously:
>> > 4018ebd0 T __sdata
>> > 4018f000 R ___start_rodata
>> >
>> > now:
>> > 4018ed98 T __sdata
>> > 40190000 R ___start_rodata
>> >
>> > So I'm thinking this may have been triggered by c74ba8b3480d ("arch:
>> > Introduce post-init read-only memory").
>> >
>> > The hack below does indeed reduce it to a single load section and this
>> > version of QEMU then succeeds:
>> >
>> > Program Headers:
>> > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
>> > LOAD 0x004000 0x40000000 0x40000000 0x34d304 0x378230 RWE 0x4000
>> > NOTE 0x1eec88 0x401eac88 0x401eac88 0x00000 0x00000 R 0x4
>> >
>> > diff --git a/arch/metag/include/asm/cache.h b/arch/metag/include/asm/cache.h
>> > index a43b650cfdc0..b5c7364a94da 100644
>> > --- a/arch/metag/include/asm/cache.h
>> > +++ b/arch/metag/include/asm/cache.h
>> > @@ -20,4 +20,6 @@
>> >
>> > #define __read_mostly __attribute__((__section__(".data..read_mostly")))
>> >
>> > +#define __ro_after_init __read_mostly
>> > +
>> > #endif
>> >
>> > Kees: Is it expected to get multiple load headers like this since your
>> > patch c74ba8b3480d ("arch: Introduce post-init read-only memory"),
>> > depending on alignment of the read only section?
>>
>> I'm not expecting that, and I'm especially not expecting any LOAD to
>> have "RWE" flags. It looks like metag's vmlinux.lds.S is using the
>> common RO_DATA_SECTION, so that doesn't jump out at me as a reason for
>> RO_AFTER_INIT_DATA to end up in the wrong place. Also the second LOAD
>> in the ELF is really huge, so it's not just the RO_AFTER_INIT_DATA
>> section (which is currently very small).
>>
>> Could the metag linker be failing to override the section flags when
>> putting RO_AFTER_INIT_DATA into RO_DATA_SECTION and this cuts the LOAD
>> in half? This smells like a linker glitch.
>
> Where is that section flags override supposed to be happening?
>
> The .rodata section is indeed becoming slightly larger to fit the
> ptmx_fops __ro_after_init, and in the process changing from flag A to
> flag WA (presumably means becoming writable).
>
> This then appears to be hitting the condition in the linker whereby a
> new segment is started if a writable section is found after a readonly
> section, and the sections are on separate pages:
>
> https://sourceware.org/git/?p=binutils.git;a=blob;f=bfd/elf.c;h=8df38ee37923c020994715f111a0ffc6bae83c8c;hb=HEAD#l3952

Hm, well, I think x86, arm, and arm64 don't do this. They just clobber
the flags from the section and force it to match the master section
(i.e. ro_after_init gets rodata's flags).

>> However, since metag doesn't support CONFIG_DEBUG_RODATA, your above
>> patch is likely correct. If metag marks memory read-only at kernel
>> load time, it's doing it early enough that __ro_after_init will fail
>> to work. If it never marks memory read-only, then __ro_after_init will
>> have no effect.
>
> Yes, metag doesn't make that memory read only yet, although it would be
> possible to do so (maybe delaying the switch to 4MB pages until
> afterwards).
>
> Either way it doesn't sound that unreasonable to have multiple load
> segments generated.

The bug here is that the ro_after_init portion of .rodata shouldn't be
writable according to the ELF headers. It should be writable only
because memory protection of .rodata hasn't happened yet. :)

>> (i.e. Both situations need the proposed patch).
>
> purely to avoid the curious linker segment behaviour (which only trips
> up that QEMU because its broken), or for other reasons too?

Well, it means the linker behavior will go away since there will be no
mixing of sections with different flags, and it'll match current
expectations: __ro_after_init won't be read-only ever (same as
.rodata).

Fixing memory permissions for metag would be nice, though! :)

-Kees

>
> Thanks
> James
>
>>
>> Probably the best solution for all architectures is to invent a new
>> CONFIG_ARCH_HAS... to identify the style of kernel memory protection a
>> given architecture has so that the default for __ro_after_init can be
>> more sensible (and the warnings about mark_rodata_ro() can better
>> reflect reality).
>>
>> -Kees
>>
>> --
>> Kees Cook
>> Nexus Security



--
Kees Cook
Nexus Security