Re: [RFC v2 2/6] x86/init: use linker tables to simplify x86 init and annotate dependencies

From: Luis R. Rodriguez
Date: Fri Feb 19 2016 - 19:43:14 EST


On Fri, Feb 19, 2016 at 08:40:49AM -0800, Andy Lutomirski wrote:
> On Fri, Feb 19, 2016 at 6:15 AM, Luis R. Rodriguez <mcgrof@xxxxxxxxxx> wrote:
> > Any failure on the x86 init path can be catastrophic.
> > A simple shift of a call from one place to another can
> > easily break things. Likewise adding a new call to
> > one path without considering all x86 requirements
> > can make certain x86 run time environments crash.
> > We currently account for these requirements through
> > peer code review and run time testing. We could do
> > much better if we had a clean and simple way to annotate
> > strong semantics for run time requirements, init sequence
> > dependencies, and detection mechanisms for additions of
> > new x86 init sequences.
>
> Please document this in a way that will be useful for people trying to
> understand what the code does as opposed to just people who are trying
> to understand why you wrote it. More below.

Sure. I'm kind of glad people are getting tired of me trying to
explain *why* I wrote it though, I figured that might be the harder
thing to explain. Hopefully it sinks in.

> > +/**
> > + * struct x86_init_fn - x86 generic kernel init call
> > + *
> > + * Linux x86 features vary in complexity, features may require work done at
> > + * different levels of the full x86 init sequence. Today there are also two
> > + * different possible entry points for Linux on x86, one for bare metal, KVM
> > + * and Xen HVM, and another for Xen PV guests / dom0. Assuming a bootloader
> > + * has set up 64-bit mode, roughly the x86 init sequence follows this path:
> > + *
> > + * Bare metal, KVM, Xen HVM Xen PV / dom0
> > + * startup_64() startup_xen()
> > + * \ /
> > + * x86_64_start_kernel() xen_start_kernel()
> > + * \ /
> > + * x86_64_start_reservations()
> > + * |
> > + * start_kernel()
> > + * [ ... ]
> > + * [ setup_arch() ]
> > + * [ ... ]
> > + * init
> > + *
>
>
> I don't think this paragraph below is necessary. I also think it's
> very confusing. Keep in mind that the reader has no idea what a
> "level" is at this point and that the reader also doesn't need to
> think about terms like "paravirtualization yielding".
>
> > + * x86_64_start_kernel() and xen_start_kernel() are the respective first C code
> > + * entry starting points. The different entry points exist to enable Xen to
> > + * skip a lot of hardware setup already done and managed on behalf of the
> > + * hypervisor, we refer to this as "paravirtualization yielding". The different
> > + * levels of init calls on the x86 init sequence exist to account for these
> > + * slight differences and requirements. These different entry points also share
> > + * a common entry x86 specific path, x86_64_start_reservations().

OK sure, I'll nuke.

> > + *
>
> And here, I don't even know what a "feature" is.

Kasan is an example.

> > + * A generic x86 feature can have different initialization calls, one on each
> > + * of the different main x86 init sequences, but must also address both entry
> > + * points in order to work properly across the board on all supported x86
> > + * subarchitectures. Since x86 features can also have dependencies on other
> > + * setup code or features, x86 features can at times be subordinate to other
> > + * x86 features, or conditions. struct x86_init_fn enables feature developers
> > + * to annotate dependency relationships to ensure subsequent init calls only
> > + * run once a subordinate's dependencies have run. When needed custom
> > + * dependency requirements can also be spelled out through a custom dependency
> > + * checker. In order to account for the dual entry point nature of x86-64 Linux
> > + * for "paravirtualization yielding" and to make annotations for support for
> > + * these explicit each struct x86_init_fn must specify supported
> > + * subarchitectures. The earliest x86-64 code can read the subarchitecture
> > + * though is after load_idt(), as such the earliest we can currently rely on
> > + * subarchitecture for semantics and a common init sequences is on the shared
> > + * common x86_64_start_reservations(). Each struct x86_init_fn must also
> > + * declare a two-digit decimal number to impose an ordering relative to other
> > + * features when required.
>
> I'm totally lost in the paragraph above.

As an example Kasan could be defined as a struct x86_init_fn with a respective
callback on x86_64_start_kernel() (kasan_early_init()) and later setup_arch()
(kasan_init()). Since we can't support for struct x86_init_fn calls to start all
the way from x86_64_start_kernel() (given the issues I had noted before)
the earliest a struct x86_init_fn can be used for is for a callback on
x86_64_start_reservations().

It just means we can different callbacks used for different parts of the
x86 init process, it explains the limits to the callbacks we can have right now.

> > + *
> > + * x86_init_fn enables strong semantics and dependencies to be defined and
> > + * implemented on the full x86 initialization sequence.
>
> Please try explaining what this is, instead. For example:
>
> An x86_init_fn represents a function to be called at a certain point
> during initialization on a specific set of subarchitectures.

OK thanks.

> > + *
> > + * @order_level: must be set, linker order level, this corresponds to the table
> > + * section sub-table index, we record this only for semantic validation
> > + * purposes.
>
> I read this as "this is purely a debugging feature". Is it? I think it's not.

Nope, the order_level really is used for linker table sorting, but since we can
modify the table for these structs with a run time sort later, we want to ensure
that the order level is still respected after we sort at run time.

> > Order-level is always required however you typically would
> > + * only use X86_INIT_NORMAL*() and leave ordering to be done by placement
> > + * of code in a C file and the order of objects through a Makefile. Custom
> > + * order-levels can be used when order on C file and order of objects on
> > + * Makfiles does not suffice or much further refinements are needed.
>
>
> Assuming I understand this correctly, here's how I'd write it:
>
> @order_level: The temporal order of this x86_init_fn. Lower
> order_level numbers are called first. Ties are broken by order found
> in the Makefile and then by order in the C file.

Thanks... yes.

>
> Note, however, that my proposed explanation can't be right because it
> appears to conflict with "depend". Please adjust accordingly.
>
> > + * @supp_hardware_subarch: must be set, it represents the bitmask of supported
> > + * subarchitectures. We require each struct x86_init_fn to have this set
> > + * to require developer considerations for each supported x86
> > + * subarchitecture and to build strong annotations of different possible
> > + * run time states particularly in consideration for the two main
> > + * different entry points for x86 Linux, to account for paravirtualization
> > + * yielding.
> '
> Too much motivation, too little documentation.
>
> @supp_hardware_subarch: A bitmask of subarchitectures on which to call
> this init function.

OK thanks.

> --- start big deletion ---
>
> > + *
> > + * The subarchitecture is read by the kernel at early boot from the
> > + * struct boot_params hardware_subarch. Support for the subarchitecture
> > + * exists as of x86 boot protocol 2.07. The bootloader would have set up
> > + * the respective hardware_subarch on the boot sector as per
> > + * Documentation/x86/boot.txt.
> > + *
> > + * What x86 entry point is used is determined at run time by the
> > + * bootloader. Linux pv_ops was designed to help enable to build one Linux
> > + * binary to support bare metal and different hypervisors. pv_ops setup
> > + * code however is limited in that all pv_ops setup code is run late in
> > + * the x86 init sequence, during setup_arch(). In fact cpu_has_hypervisor
> > + * only works after early_cpu_init() during setup_arch(). If an x86
> > + * feature requires an earlier determination of what hypervisor was used,
> > + * or if it needs to annotate only support for certain hypervisors, the
> > + * x86 hardware_subarch should be set by the bootloader and
> > + * @supp_hardware_subarch set by the x86 feature. Using hardware_subarch
> > + * enables x86 features to fill the semantic gap between the Linux x86
> > + * entry point used and what pv_ops has to offer through a hypervisor
> > + * agnostic mechanism.
>
> --- end big deletion ---

OK!

> > + *
> > + * Each supported subarchitecture is set using the respective
> > + * X86_SUBARCH_* as a bit in the bitmask. For instance if a feature
> > + * is supported on PC and Xen subarchitectures only you would set this
> > + * bitmask to:
> > + *
> > + * BIT(X86_SUBARCH_PC) |
> > + * BIT(X86_SUBARCH_XEN);
>
> I like this part, but how about "For instance, if an init function
> should be called on PC and Xen subarchitectures only, you would set
> the bitmask to..."?

Sure.

> > + *
> > + * @detect: optional, if set returns true if the feature has been detected to
> > + * be required, it returns false if the feature has been detected to not
> > + * be required.
>
> I have absolutely no idea what this means.

Well, so the order level stuff is used at link time, there however are some
run time environment conditions that might be important to consider as well,
the detect lets you write a C code routine which will check to see if the
feature should run, it returns false if the feature should not run or is
not required to run.

A cheesy way to implement a cheesy graph is to then use the same callback
as a pointer for dependency, so that if a feature depends on another, both
are on the same order level, and if they have a run time dependency
annotation, they can further annotate this relationship by having the
thing that depends, set its depends pointer to refer to the other
feature's detect callback.

This is borrowed from the IOMMU init code in the kernel. Same logic.
Only thing is we also get link order support too. Link order semantics
and logic should suffice for most things, but not all as with the IOMMU
init stuff.

> > + * @depend: optional, if set this set of init routines must be called prior to
> > + * the init routine who's respective detect routine we have set this
> > + * depends callback to. This is only used for sorting purposes given
> > + * all current init callbacks have a void return type. Sorting is
> > + * implemented via x86_init_fn_sort(), it must be called only once,
> > + * however you can delay sorting until you need it if you can ensure
> > + * only @order_level and @supp_hardware_subarch can account for proper
> > + * ordering and dependency requirements for all init sequences prior.
> > + * If you do not have a depend callback set its assumed the order level
> > + * (__x86_init_fn(level)) set by the init routine suffices to set the
> > + * order for when the feature's respective callbacks are called with
> > + * respect to other calls. Sorting of init calls with the same order level
> > + * is determined by linker order, determined by order placement on C code
> > + * and order listed on a Makefile. A routine that depends on another is
> > + * known as being subordinate to the init routine it depends on. Routines
> > + * that are subordinate must have an order-level of lower priority or
> > + * equal priority than the order-level of the init sequence it depends on.
>
> I don't understand this at all. I assume you're saying that some kind
> of topological sorting happens. This leads to a question: why is
> "depend" a function pointer? Shouldn't it be x86_init_fn*?

That's fine too. It was just borrowing the implementation from
the iommu solution. I've modified that considerably so could not
just extend it. After some more though I also considered generalizing
it but decided against it given that run time sorting can depend
on very custom semantics and most importantly data structure size.

> Also,
> what happens if you depend on something that is disabled on the
> running subarch?

We have a flag that annotates if a callback runs, if the flag is not
set then a depends thing will ensure to not call the routine that
has depends set.

> And what happens if the depend-implied order is
> inconsistent with order_level.

Ah -- that's what I was trying to explain above in the comment
about inconsistent state -- a run time check *after* we do run
time sorting goes through and checks for sanity.

All this is a light weight feature graph, if you will, other
folks are working on different solutions for this and while
I have considered sharing a solution for early init code this
seems a bit more ideal for now. Later in the future though we
should probably get together and talk about a proper feature
graph, with more consistent checks, etc. Its another reason
why I am also curious to see a SAT solver at run time in the
future, but this is super long term [0]...

[0] http://kernelnewbies.org/KernelProjects/kconfig-sat

> I would hope that order_level breaks ties in the topological sort and
> that there's a bit fat warning if the order_level ordering is
> inconsistent with the topological ordering. Preferably the warning
> would be at build time, in which case it could be an error.

Order level will always be respected at link time. The depend
and detect thing are heuristics used at run time and since
it depends on run time state we can't do this at build time.
The big fat hairy nasty warning is there though. Curious
enough I ended up finding out that the sanity check for
IOMMU was intended to only be run at debug time but the
debug thing was *always* set to true, so we've been running
the sanity checker in the kernel for ages. Its why I also
feel a bit more comfortable in keeping it live and matching
more close to its implementation. I've gone to pains to
build off of the IOMMU init solution and simply extend
on the semantics there. Turns out there was much room for
enhancements. All the fine tuning are documented in the
userspace tree.

> > + * @early_init: required, routine which will run in x86_64_start_reservations()
> > + * after we ensure boot_params.hdr.hardware_subarch is accessible and
> > + * properly set. Memory is not yet available. This the earliest we can
> > + * currently define a common shared callback since all callbacks need to
> > + * check for boot_params.hdr.hardware_subarch and this becomes accessible
> > + * on x86-64 until after load_idt().
>
> What's this for? Under what conditions is it called? What order?
> Why is it part of x86_init_fn at all?

Its one of the reasons we are adding this, callbacks to for features.
Its called if the its requirements are met (subarch, and depends).
I'll work on extending this to be clearer.

> > + * @flags: optional, bitmask of enum x86_init_fn_flags
>
> What are these flags? What do they do?

They tell us the supported subarchs that must be met matched
at run time in order for the feature's callback to run.

> > + */
> > +struct x86_init_fn {
> > + __u32 order_level;
> > + __u32 supp_hardware_subarch;
> > + bool (*detect)(void);
> > + bool (*depend)(void);
> > + void (*early_init)(void);
> > + __u32 flags;
> > +};
> > +
> > +/**
> > + * enum x86_init_fn_flags: flags for init sequences
> > + *
> > + * X86_INIT_FINISH_IF_DETECTED: tells the core that once this init sequence
> > + * has completed it can break out of the loop for init sequences on
> > + * its own level.
>
> What does this mean?

It borrows the logic from the IOMMU stuff, it enables the core to detect
if *any* routine in the linker table has completed we can then now just
bail and stop going through the rest of the linker table.

> > + * X86_INIT_DETECTED: private flag. Used by the x86 core to annotate that this
> > + * init sequence has been detected and it all of its callbacks
> > + * must be run during initialization.
>
>
> Please make this an entry in a new field scratch_space or similar and
> just document the entire field as "private to the init core code".

OK.

> > +static struct x86_init_fn *x86_init_fn_find_dep(struct x86_init_fn *start,
> > + struct x86_init_fn *finish,
> > + struct x86_init_fn *q)
> > +{
> > + struct x86_init_fn *p;
> > +
> > + if (!q)
> > + return NULL;
> > +
> > + for (p = start; p < finish; p++)
> > + if (p->detect == q->depend)
> > + return p;
>
> That's very strange indeed, and it doesn't seem consistent with my
> explanation. Please fix up the docs to explain what's going on.
> Again, as a reviewer and eventual user of this code, I do not need to
> know what historical problem it solves. I need to know what the code
> does and how to use it.

Understood. To help please note that this borrows from the IOMMU stuff,
which has even less documentation. I was trying to actually put some
good effort to fix that for our sake and to enhance the semantics
further. If things seem confusing just keep in mind this is the
IOMMU init solution with with enhanced semantics and linker table
support.

> > +
> > +void __ref x86_init_fn_init_tables(void)
>
> Why is this __ref and not __init?

Yeah __init should work better, thanks.

Luis