Re: [RFC/RFT PATCH 2/5] memblock: introduce generic memblock_setup_resources()

From: Mike Rapoport
Date: Wed Jun 02 2021 - 14:43:50 EST


On Wed, Jun 02, 2021 at 04:51:41PM +0100, Russell King (Oracle) wrote:
> On Wed, Jun 02, 2021 at 04:54:17PM +0300, Mike Rapoport wrote:
> > On Wed, Jun 02, 2021 at 11:15:21AM +0100, Russell King (Oracle) wrote:
> > > On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote:
> > > > On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote:
> > > > > If I look at one of my kernels:
> > > > >
> > > > > c0008000 T _text
> > > > > c0b5b000 R __end_rodata
> > > > > ... exception and unwind tables live here ...
> > > > > c0c00000 T __init_begin
> > > > > c0e00000 D _sdata
> > > > > c0e68870 D _edata
> > > > > c0e68870 B __bss_start
> > > > > c0e995d4 B __bss_stop
> > > > > c0e995d4 B _end
> > > > >
> > > > > So the original covers _text..__init_begin-1 which includes the
> > > > > exception and unwind tables. Your version above omits these, which
> > > > > leaves them exposed.
> > > >
> > > > Right, this needs to be fixed. Is there any reason the exception and unwind
> > > > tables cannot be placed between _sdata and _edata?
> > > >
> > > > It seems to me that they were left outside for purely historical reasons.
> > > > Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time")
> > > > moved the exception tables out of .data section before _sdata existed.
> > > > Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved
> > > > _etext before the unwind tables and didn't bother to put them into data or
> > > > rodata areas.
> > >
> > > You can not assume that all sections will be between these symbols. This
> > > isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you
> > > will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along
> > > with many other undiscarded sections before __bss_start.
> >
> > But if you look at x86's setup_arch() all these never make it to the
> > resource tree. So there are holes in /proc/iomem between the kernel
> > resources.
>
> Also true. However, my point was to counter your claim that these
> sections should be part of the .text/.data/.rodata etc sections in the
> output vmlinux.
>
> There is, however, a more important point. The __ex_table section
> must exist and be separate from the .text/.data/.rodata sections in
> the output ELF file, as sorttable (the exception table sorter) relies
> on this to be able to find the table and sort it.
>
> So, it isn't entirely "for historical reasons" as you said two messages
> ago.

Back then when __ex_table was moved from .data section, _sdata and _edata
were part of the .data section. Today they are not. So something like the
patch below will ensure for instance that __ex_table would be a part of
"Kernel data" in /proc/iomem without moving it to the .data section:

diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index f7f4620d59c3..2991feceab31 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -72,13 +72,6 @@ SECTIONS

RO_DATA(PAGE_SIZE)

- . = ALIGN(4);
- __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
- __start___ex_table = .;
- ARM_MMU_KEEP(*(__ex_table))
- __stop___ex_table = .;
- }
-
#ifdef CONFIG_ARM_UNWIND
ARM_UNWIND_SECTIONS
#endif
@@ -143,6 +136,14 @@ SECTIONS
__init_end = .;

_sdata = .;
+
+ . = ALIGN(4);
+ __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
+ __start___ex_table = .;
+ ARM_MMU_KEEP(*(__ex_table))
+ __stop___ex_table = .;
+ }
+
RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
_edata = .;


> Now, bear in mind that /proc/iomem is a user API, one which userspace
> depends on. If we start going around making /proc/iomem report stuff
> like kernel boot time reservations as "reserved" memory, we will end up
> breaking the kexec tooling on some platforms. For example, kexec
> tooling for 32-bit ARM parses /proc/iomem, looking for "System RAM",
> "System RAM (boot alias)" and "reserved" regions.
>
> So, I think changes to make this "more consistent" come with high
> risk.

I agree there is a risk but I don't think it's high. It does not look like
the minor changes in "reserved" reporting in /proc/iomem will break kexec
tooling. Anyway the amount of reserved and free memory depends on a
particular system, kernel version, configuration and command line.
I have no intention to report kernel boot time reservations
to /proc/iomem on architectures that do not report them there today,
although this also does not seem like a significant factor.

On the other hand, making /proc/iomem reporting consistent among
architectures will allow to reduce complexity of both the kernel and kexec
tools in the long run.

--
Sincerely yours,
Mike.