Re: [RFC v2 2/6] x86/init: use linker tables to simplify x86 init and annotate dependencies

From: Andy Lutomirski
Date: Sat Feb 20 2016 - 02:31:44 EST


On Fri, Feb 19, 2016 at 4:42 PM, Luis R. Rodriguez <mcgrof@xxxxxxxxx> wrote:
> On Fri, Feb 19, 2016 at 08:40:49AM -0800, Andy Lutomirski wrote:
>> On Fri, Feb 19, 2016 at 6:15 AM, Luis R. Rodriguez <mcgrof@xxxxxxxxxx> wrote:
>> > Any failure on the x86 init path can be catastrophic.
>> > A simple shift of a call from one place to another can
>> > easily break things. Likewise adding a new call to
>> > one path without considering all x86 requirements
>> > can make certain x86 run time environments crash.
>> > We currently account for these requirements through
>> > peer code review and run time testing. We could do
>> > much better if we had a clean and simple way to annotate
>> > strong semantics for run time requirements, init sequence
>> > dependencies, and detection mechanisms for additions of
>> > new x86 init sequences.
>>
>> Please document this in a way that will be useful for people trying to
>> understand what the code does as opposed to just people who are trying
>> to understand why you wrote it. More below.
>
> Sure. I'm kind of glad people are getting tired of me trying to
> explain *why* I wrote it though, I figured that might be the harder
> thing to explain. Hopefully it sinks in.
>
>> > +/**
>> > + * struct x86_init_fn - x86 generic kernel init call
>> > + *
>> > + * Linux x86 features vary in complexity, features may require work done at
>> > + * different levels of the full x86 init sequence. Today there are also two
>> > + * different possible entry points for Linux on x86, one for bare metal, KVM
>> > + * and Xen HVM, and another for Xen PV guests / dom0. Assuming a bootloader
>> > + * has set up 64-bit mode, roughly the x86 init sequence follows this path:
>> > + *
>> > + * Bare metal, KVM, Xen HVM Xen PV / dom0
>> > + * startup_64() startup_xen()
>> > + * \ /
>> > + * x86_64_start_kernel() xen_start_kernel()
>> > + * \ /
>> > + * x86_64_start_reservations()
>> > + * |
>> > + * start_kernel()
>> > + * [ ... ]
>> > + * [ setup_arch() ]
>> > + * [ ... ]
>> > + * init
>> > + *
>>
>>
>> I don't think this paragraph below is necessary. I also think it's
>> very confusing. Keep in mind that the reader has no idea what a
>> "level" is at this point and that the reader also doesn't need to
>> think about terms like "paravirtualization yielding".
>>
>> > + * x86_64_start_kernel() and xen_start_kernel() are the respective first C code
>> > + * entry starting points. The different entry points exist to enable Xen to
>> > + * skip a lot of hardware setup already done and managed on behalf of the
>> > + * hypervisor, we refer to this as "paravirtualization yielding". The different
>> > + * levels of init calls on the x86 init sequence exist to account for these
>> > + * slight differences and requirements. These different entry points also share
>> > + * a common entry x86 specific path, x86_64_start_reservations().
>
> OK sure, I'll nuke.
>
>> > + *
>>
>> And here, I don't even know what a "feature" is.
>
> Kasan is an example.

Is kasan different from the cr4 shadow in this context? I don't
really feel like calling the cr4 shadow a "feature" is meaningful. Am
I misunderstanding something?

>
>> > + * A generic x86 feature can have different initialization calls, one on each
>> > + * of the different main x86 init sequences, but must also address both entry
>> > + * points in order to work properly across the board on all supported x86
>> > + * subarchitectures. Since x86 features can also have dependencies on other
>> > + * setup code or features, x86 features can at times be subordinate to other
>> > + * x86 features, or conditions. struct x86_init_fn enables feature developers
>> > + * to annotate dependency relationships to ensure subsequent init calls only
>> > + * run once a subordinate's dependencies have run. When needed custom
>> > + * dependency requirements can also be spelled out through a custom dependency
>> > + * checker. In order to account for the dual entry point nature of x86-64 Linux
>> > + * for "paravirtualization yielding" and to make annotations for support for
>> > + * these explicit each struct x86_init_fn must specify supported
>> > + * subarchitectures. The earliest x86-64 code can read the subarchitecture
>> > + * though is after load_idt(), as such the earliest we can currently rely on
>> > + * subarchitecture for semantics and a common init sequences is on the shared
>> > + * common x86_64_start_reservations(). Each struct x86_init_fn must also
>> > + * declare a two-digit decimal number to impose an ordering relative to other
>> > + * features when required.
>>
>> I'm totally lost in the paragraph above.
>
> As an example Kasan could be defined as a struct x86_init_fn with a respective
> callback on x86_64_start_kernel() (kasan_early_init()) and later setup_arch()
> (kasan_init()). Since we can't support for struct x86_init_fn calls to start all
> the way from x86_64_start_kernel() (given the issues I had noted before)
> the earliest a struct x86_init_fn can be used for is for a callback on
> x86_64_start_reservations().

I'm still lost. Kasan is a pile of code that does something. Kasan
has some initialization code. Do you mean that kasan's initialization
code could be encapsulated in one or more x86_init_fn instances?

>
> It just means we can different callbacks used for different parts of the
> x86 init process, it explains the limits to the callbacks we can have right now.

Do you mean that there would be different linker tables containing
x86_init_fn instances for each of these callback points? Or is
"early_init" the only such callback point.

BTW, why is early_init called early_init? Why not call it "callback"?
Then you could reuse this for other things.

>
>> > + *
>> > + * x86_init_fn enables strong semantics and dependencies to be defined and
>> > + * implemented on the full x86 initialization sequence.
>>
>> Please try explaining what this is, instead. For example:
>>
>> An x86_init_fn represents a function to be called at a certain point
>> during initialization on a specific set of subarchitectures.
>
> OK thanks.
>
>> > + *
>> > + * @order_level: must be set, linker order level, this corresponds to the table
>> > + * section sub-table index, we record this only for semantic validation
>> > + * purposes.
>>
>> I read this as "this is purely a debugging feature". Is it? I think it's not.
>
> Nope, the order_level really is used for linker table sorting, but since we can
> modify the table for these structs with a run time sort later, we want to ensure
> that the order level is still respected after we sort at run time.
>
>> > Order-level is always required however you typically would
>> > + * only use X86_INIT_NORMAL*() and leave ordering to be done by placement
>> > + * of code in a C file and the order of objects through a Makefile. Custom
>> > + * order-levels can be used when order on C file and order of objects on
>> > + * Makfiles does not suffice or much further refinements are needed.
>>
>>
>> Assuming I understand this correctly, here's how I'd write it:
>>
>> @order_level: The temporal order of this x86_init_fn. Lower
>> order_level numbers are called first. Ties are broken by order found
>> in the Makefile and then by order in the C file.
>
> Thanks... yes.
>
>>
>> Note, however, that my proposed explanation can't be right because it
>> appears to conflict with "depend". Please adjust accordingly.
>>
>> > + * @supp_hardware_subarch: must be set, it represents the bitmask of supported
>> > + * subarchitectures. We require each struct x86_init_fn to have this set
>> > + * to require developer considerations for each supported x86
>> > + * subarchitecture and to build strong annotations of different possible
>> > + * run time states particularly in consideration for the two main
>> > + * different entry points for x86 Linux, to account for paravirtualization
>> > + * yielding.
>> '
>> Too much motivation, too little documentation.
>>
>> @supp_hardware_subarch: A bitmask of subarchitectures on which to call
>> this init function.
>
> OK thanks.
>

>
>> > + *
>> > + * @detect: optional, if set returns true if the feature has been detected to
>> > + * be required, it returns false if the feature has been detected to not
>> > + * be required.
>>
>> I have absolutely no idea what this means.
>
> Well, so the order level stuff is used at link time, there however are some
> run time environment conditions that might be important to consider as well,
> the detect lets you write a C code routine which will check to see if the
> feature should run, it returns false if the feature should not run or is
> not required to run.

So how is:

static bool detect_foo(void) { return whatever; }
statis void init_foo(void) { do something; }

.detect = detect_foo,
.early_init = init_foo,

different from:

static void init_foo(void)
{
if (whatever) do something;
}

.early_init = init_foo,


>
> A cheesy way to implement a cheesy graph is to then use the same callback
> as a pointer for dependency, so that if a feature depends on another, both
> are on the same order level, and if they have a run time dependency
> annotation, they can further annotate this relationship by having the
> thing that depends, set its depends pointer to refer to the other
> feature's detect callback.
>
> This is borrowed from the IOMMU init code in the kernel. Same logic.
> Only thing is we also get link order support too. Link order semantics
> and logic should suffice for most things, but not all as with the IOMMU
> init stuff.

But this doesn't work at all if .detect isn't set.

>> And what happens if the depend-implied order is
>> inconsistent with order_level.
>
> Ah -- that's what I was trying to explain above in the comment
> about inconsistent state -- a run time check *after* we do run
> time sorting goes through and checks for sanity.
>
> All this is a light weight feature graph, if you will, other
> folks are working on different solutions for this and while
> I have considered sharing a solution for early init code this
> seems a bit more ideal for now. Later in the future though we
> should probably get together and talk about a proper feature
> graph, with more consistent checks, etc. Its another reason
> why I am also curious to see a SAT solver at run time in the
> future, but this is super long term [0]...
>
> [0] http://kernelnewbies.org/KernelProjects/kconfig-sat
>
>> I would hope that order_level breaks ties in the topological sort and
>> that there's a bit fat warning if the order_level ordering is
>> inconsistent with the topological ordering. Preferably the warning
>> would be at build time, in which case it could be an error.
>
> Order level will always be respected at link time. The depend
> and detect thing are heuristics used at run time and since
> it depends on run time state we can't do this at build time.
> The big fat hairy nasty warning is there though. Curious
> enough I ended up finding out that the sanity check for
> IOMMU was intended to only be run at debug time but the
> debug thing was *always* set to true, so we've been running
> the sanity checker in the kernel for ages. Its why I also
> feel a bit more comfortable in keeping it live and matching
> more close to its implementation. I've gone to pains to
> build off of the IOMMU init solution and simply extend
> on the semantics there. Turns out there was much room for
> enhancements. All the fine tuning are documented in the
> userspace tree.

So is this what you're saying:

Callbacks are called in an order consistent with the dependencies
specified by @depend. Ties are broken first by @order_level and then
by linker order. It is an error for an x86_init_fn with lower
@order_level to depend on an x86_init_fn with higher @order_level.

>
>> > + * @early_init: required, routine which will run in x86_64_start_reservations()
>> > + * after we ensure boot_params.hdr.hardware_subarch is accessible and
>> > + * properly set. Memory is not yet available. This the earliest we can
>> > + * currently define a common shared callback since all callbacks need to
>> > + * check for boot_params.hdr.hardware_subarch and this becomes accessible
>> > + * on x86-64 until after load_idt().
>>
>> What's this for? Under what conditions is it called? What order?
>> Why is it part of x86_init_fn at all?
>
> Its one of the reasons we are adding this, callbacks to for features.
> Its called if the its requirements are met (subarch, and depends).
> I'll work on extending this to be clearer.

I think this should just be ".callback". Let the context by implied
by the linker table in which it appears.
.
>
>> > + */
>> > +struct x86_init_fn {
>> > + __u32 order_level;
>> > + __u32 supp_hardware_subarch;
>> > + bool (*detect)(void);
>> > + bool (*depend)(void);
>> > + void (*early_init)(void);
>> > + __u32 flags;
>> > +};
>> > +
>> > +/**
>> > + * enum x86_init_fn_flags: flags for init sequences
>> > + *
>> > + * X86_INIT_FINISH_IF_DETECTED: tells the core that once this init sequence
>> > + * has completed it can break out of the loop for init sequences on
>> > + * its own level.
>>
>> What does this mean?
>
> It borrows the logic from the IOMMU stuff, it enables the core to detect
> if *any* routine in the linker table has completed we can then now just
> bail and stop going through the rest of the linker table.

Can you give an example for which this would be used?

--Andy