Re: [PATCH v3 4/6] module: script to generate offset ranges for builtin modules

From: Masahiro Yamada
Date: Mon May 20 2024 - 12:54:18 EST


On Fri, May 17, 2024 at 1:31 PM Kris Van Hees <kris.van.hees@oraclecom> wrote:
>
> The offset range data for builtin modules is generated using:
> - modules.builtin.modinfo: associates object files with module names
> - vmlinux.map: provides load order of sections and offset of first member
> per section
> - vmlinux.o.map: provides offset of object file content per section
> - .*.cmd: build cmd file with KBUILD_MODFILE and KBUILD_MODNAME
>
> The generated data will look like:
>
> .text 00000000-00000000 = _text
> .text 0000baf0-0000cb10 amd_uncore
> .text 0009bd10-0009c8e0 iosf_mbi
> ...
> .text 008e6660-008e9630 snd_soc_wcd_mbhc
> .text 008e9630-008ea610 snd_soc_wcd9335 snd_soc_wcd934x snd_soc_wcd938x
> .text 008ea610-008ea780 snd_soc_wcd9335
> ...
> .data 00000000-00000000 = _sdata
> .data 0000f020-0000f680 amd_uncore
>
> For each ELF section, it lists the offset of the first symbol. This can
> be used to determine the base address of the section at runtime.
>
> Next, it lists (in strict ascending order) offset ranges in that section
> that cover the symbols of one or more builtin modules. Multiple ranges
> can apply to a single module, and ranges can be shared between modules.
>
> Signed-off-by: Kris Van Hees <kris.van.hees@xxxxxxxxxx>
> Reviewed-by: Nick Alcock <nick.alcock@xxxxxxxxxx>
> ---
> Changes since v2:
> - 1st arg to generate_builtin_ranges.awk is now modules.builtin.modinfo
> - Switched from using modules.builtin.objs to parsing .*.cmd files
> - Parse data from .*.cmd in generate_builtin_ranges.awk
> ---
> scripts/generate_builtin_ranges.awk | 232 ++++++++++++++++++++++++++++
> 1 file changed, 232 insertions(+)
> create mode 100755 scripts/generate_builtin_ranges.awk
>
> diff --git a/scripts/generate_builtin_ranges.awk b/scripts/generate_builtin_ranges.awk
> new file mode 100755
> index 0000000000000..6975a9c7266d9
> --- /dev/null
> +++ b/scripts/generate_builtin_ranges.awk
> @@ -0,0 +1,232 @@
> +#!/usr/bin/gawk -f
> +# SPDX-License-Identifier: GPL-2.0
> +# generate_builtin_ranges.awk: Generate address range data for builtin modules
> +# Written by Kris Van Hees <kris.van.hees@xxxxxxxxxx>
> +#
> +# Usage: generate_builtin_ranges.awk modules.builtin.modinfo vmlinux.map \
> +# vmlinux.o.map > modules.builtin.ranges
> +#
> +
> +BEGIN {
> + # modules.builtin.modinfo uses \0 as record separator
> + # All other files use \n.
> + RS = "[\n\0]";
> +}


Why do you want to parse modules.builtin.modinfo
instead of modules.builtin?

modules.builtin uses \n separator.






> +
> +# Return the module name(s) (if any) associated with the given object.
> +#
> +# If we have seen this object before, return information from the cache.
> +# Otherwise, retrieve it from the corresponding .cmd file.
> +#
> +function get_module_info(fn, mod, obj, mfn, s) {


There are 5 arguments, while the caller passes only 1 argument
( get_module_info($4) )







> + if (fn in omod)
> + return omod[fn];
> +
> + if (match(fn, /\/[^/]+$/) == 0)
> + return "";
> +
> + obj = fn;
> + mod = "";
> + mfn = "";
> + fn = substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) ".cmd";
> + if (getline s <fn == 1) {
> + if (match(s, /DKBUILD_MODNAME=[^ ]+/) > 0) {
> + mod = substr(s, RSTART + 17, RLENGTH - 17);
> + gsub(/['"]/, "", mod);
> + gsub(/:/, " ", mod);
> + }
> +
> + if (match(s, /DKBUILD_MODFILE=[^ ]+/) > 0) {
> + mfn = substr(s, RSTART + 17, RLENGTH - 17);
> + gsub(/['"]/, "", mfn);
> + gsub(/:/, " ", mfn);
> + }
> + }
> + close(fn);
> +
> +# tmp = $0;
> +# $0 = s;
> +# print mod " " mfn " " obj " " $NF;
> +# $0 = tmp;
> +
> + # A single module (common case) also reflects objects that are not part
> + # of a module. Some of those objects have names that are also a module
> + # name (e.g. core). We check the associated module file name, and if
> + # they do not match, the object is not part of a module.


You do not need to use KBUILD_MODNAME.

Just use KBUILD_MODFILE only.
If the same path is found in modules.builtin,
it is a built-in module.

Its basename is modname.




One more question in a corner case.

How does your code work when an object is shared
by multiple modules?


For example, set
CONFIG_EDAC_SKX=y
CONFIG_EDAC_I10NM=y

How is the address range of drivers/edac/skx_common.o handled?

There are 4 possibilities.

- included in skx_edac
- included in i10nm_edac
- included in both of them
- not included in any of them

The correct behavior should be "included in both of them".

How does your code work?









> + if (mod !~ / /) {
> + if (!(mod in mods))
> + return "";
> + if (mods[mod] != mfn)
> + return "";
> + }
> +
> + # At this point, mod is a single (valid) module name, or a list of
> + # module names (that do not need validation).
> + omod[obj] = mod;
> + close(fn);
> +
> + return mod;
> +}
> +
> +FNR == 1 {
> + FC++;
> +}
> +
> +# (1-old) Build a mapping to associate object files with built-in module names.
> +#
> +# The first file argument is used as input (modules.builtin.objs).
> +#
> +FC == 1 && old_behaviour {
> + sub(/:/, "");
> + mod = $1;
> + sub(/([^/]*\/)+/, "", mod);
> + sub(/\.o$/, "", mod);
> + gsub(/-/, "_", mod);
> +
> + if (NF > 1) {
> + for (i = 2; i <= NF; i++) {
> + if ($i in mods)
> + mods[$i] = mods[$i] " " mod;
> + else
> + mods[$i] = mod;
> + }
> + } else
> + mods[$1] = mod;
> +
> + next;
> +}


Please remove the old code.





> +# (1) Build a lookup map of built-in module names.
> +#
> +# The first file argument is used as input (modules.builtin.modinfo).
> +#
> +# We are interested in lines that follow the format
> +# <modname>.file=<path>
> +# and use them to record <modname>
> +#
> +FC == 1 && /^[^\.]+.file=/ {
> + gsub(/[\.=]/, " ");
> +# print $1 " -> " $3;
> + mods[$1] = $3;
> + next;
> +}


I guess parsing module.builtin will be simpler.






> +
> +# (2) Determine the load address for each section.
> +#
> +# The second file argument is used as input (vmlinux.map).
> +#
> +# Since some AWK implementations cannot handle large integers, we strip of the
> +# first 4 hex digits from the address. This is safe because the kernel space
> +# is not large enough for addresses to extend into those digits.
> +#
> +FC == 2 && /^\./ && NF > 2 {
> + if (type)
> + delete sect_addend[type];
> +
> + if ($1 ~ /percpu/)
> + next;
> +
> + raw_addr = $2;
> + addr_prefix = "^" substr($2, 1, 6);
> + sub(addr_prefix, "0x", $2);
> + base = strtonum($2);
> + type = $1;
> + anchor = 0;
> + sect_base[type] = base;
> +
> + next;
> +}
> +
> +!type {
> + next;
> +}
> +
> +# (3) We need to determine the base address of the section so that ranges can
> +# be expressed based on offsets from the base address. This accommodates the
> +# kernel sections getting loaded at different addresses than what is recorded
> +# in vmlinux.map.
> +#
> +# At runtime, we will need to determine the base address of each section we are
> +# interested in. We do that by recording the offset of the first symbol in the
> +# section. Once we know the address of this symbol in the running kernel, we
> +# can calculate the base address of the section.
> +#
> +# If possible, we use an explicit anchor symbol (sym = .) listed at the base
> +# address (offset 0).
> +#
> +# If there is no such symbol, we record the first symbol in the section along
> +# with its offset.
> +#
> +# We also determine the offset of the first member in the section in case the
> +# final linking inserts some content between the start of the section and the
> +# first member. I.e. in that case, vmlinux.map will list the first member at
> +# a non-zero offset whereas vmlinux.o.map will list it at offset 0. We record
> +# the addend so we can apply it when processing vmlinux.o.map (next).
> +#
> +FC == 2 && !anchor && raw_addr == $1 && $3 == "=" && $4 == "." {
> + anchor = sprintf("%s %08x-%08x = %s", type, 0, 0, $2);
> + sect_anchor[type] = anchor;
> +
> + next;
> +}
> +
> +FC == 2 && !anchor && $1 ~ /^0x/ && $2 !~ /^0x/ && NF <= 4 {
> + sub(addr_prefix, "0x", $1);
> + addr = strtonum($1) - base;
> + anchor = sprintf("%s %08x-%08x = %s", type, addr, addr, $2);
> + sect_anchor[type] = anchor;
> +
> + next;
> +}
> +
> +FC == 2 && base && /^ \./ && $1 == type && NF == 4 {
> + sub(addr_prefix, "0x", $2);
> + addr = strtonum($2);
> + sect_addend[type] = addr - base;
> +
> + if (anchor) {
> + base = 0;
> + type = 0;
> + }
> +
> + next;
> +}
> +
> +# (4) Collect offset ranges (relative to the section base address) for built-in
> +# modules.
> +#
> +FC == 3 && /^ \./ && NF == 4 && $3 != "0x0" {
> + type = $1;
> + if (!(type in sect_addend))
> + next;


This assumes sections are 1:1 mapping
between vmlinux.o and vmlinux.

How far does this assumption work?


CONFIG_LD_DEAD_CODE_DATA_ELIMINATION will not work
at least.



As I said in the previous review,
gawk is not documented in Documentation/process/changes.rst

Please add it if you go with gawk.






> +
> + sub(addr_prefix, "0x", $2);
> + addr = strtonum($2) + sect_addend[type];
> +
> + mod = get_module_info($4);
> +# printf "[%s, %08x] %s [%s] %08x\n", mod_name, mod_start, $4, mod, addr;
> + if (mod == mod_name)
> + next;
> +
> + if (mod_name) {
> + idx = mod_start + sect_base[type] + sect_addend[type];
> + entries[idx] = sprintf("%s %08x-%08x %s", type, mod_start, addr, mod_name);
> + count[type]++;
> + }
> +# if (mod == "")
> +# printf "ENTRY WITHOUT MOD - MODULE MAY END AT %08x\n", addr
> +
> + mod_name = mod;
> + mod_start = addr;
> +}
> +
> +END {
> + for (type in count) {
> + if (type in sect_anchor)
> + entries[sect_base[type]] = sect_anchor[type];
> + }
> +
> + n = asorti(entries, indices);
> + for (i = 1; i <= n; i++)
> + print entries[indices[i]];
> +}
> --
> 2.43.0
>


--
Best Regards

Masahiro Yamada