Re: Status of tip/x86/apic

From: Jiang Liu
Date: Sun Dec 14 2014 - 05:57:44 EST


On 2014/12/13 4:35, Thomas Gleixner wrote:
> Folks,
>
> after mulling this in my head for quite some time, I'm going to
> postpone the whole thing for 3.20.
>
> That said, I need to say, that I'm really happy with the outcome of
> this massive overhaul. I really want to thank all involved people,
> especially Jiang, for their great work and help so far!!!
>
> The hierarchical irq domains really improve the code by distangling
> the various subsystems and the arm[64] use cases just prove that it
> was the right decision.
>
> We're almost there with x86 but my gut feeling tells me that pushing
> it now is too risky. I rather prefer quiet holidays for all of us than
> the nagging fear that the post holiday inbox will be full of obscure
> bug reports and we then start a chase and bandaid race which will kill
> the well earned recreation in an instant.
Hi Thomas,
It's more safe to let it mature for another merge window
in tip tree:)

>
> This will block other things in that area for a while, but it's the
> only sane decision at the moment, unless Linus insists on pulling the
> lot and promises to deal with the fallout. :)
>
> The reasons why I decided to do so are:
>
> - The bugs we found in the last week. That tells me that there is
> some more stuff lurking.
>
> - The already existing mess in a some areas which got unearthed by
> this work in the last week. That definitely needs a thorough
> cleanup and not some more bandaids.
>
> - Lack of proper debugging features. Sending out per issue debug
> patches simply does not scale.
>
> - It's not bisectable and unfortunately there are too many fixes to
> various places to make manual bisection feasible.
>
> For 3.20 I want to proceed in the following way:
>
> - Apply all bug fixes to x86/apic
>
> - Address the issues with the resource management (and elsewhere)
> proper on top
>
> - Add a proper debugging mechanism (the existing irqdomain debugfs
> interface is completely useless).
>
> For the hierarchical domains we really want two things:
>
> 1) A debugfs interface which lets us introspect the hierarchy.
>
> I was working on that before I got dragged into bug chasing and
> merge window frenzy.
>
> For proper introspection down to the hardware level this
> requires either domain/irq_chip specific callbacks or some
> unified way to track the current state. The latter is painful as
> it requires to store information redundantly.
>
> So having domain/chip callbacks to retrieve the state is the
> right solution. Most chip/domain implementations cache their
> [hardware] state already, so providing an accessor to convert
> that into a common data format is the best way. If the callback
> is not implemented then the information is not available or
> maybe not relevant.
>
> I'm not going to have a per domain/chip seqfile print function
> as this is just a complete waste. Pretty printing obscure
> hardware information does not help much for the general user. We
> rather have the raw data and proper post processing tools which
> can provide that pretty print information than bloating the
> kernel binary with randomized and possibly useless seq_print
> functions.
>
> Another reason why I want just raw binary data is that I want to
> use exactly the same mechanism for tracing. See below.
>
> After looking at the various new domain/chip implementations its
> sufficient to have 16 bytes of storage space for this, but
> that's a minor detail.
>
> To provide a proper translation into pretty printed values we
> can do the following:
>
> Create a new section for storing such data and have a data
> structure there which describes the content of the buffer. That
> section goes into a seperate file and not linked into the
> kernel binary. Simple enough for tools to pick up and for bug
> reporters to use/provide. If the stupid file is not available
> we still can recreate it from source and translate the hex
> dump. And in the most cases the pure hexdump will be sufficient
> for the people who need actually to look at this.
>
> 2) Proper trace point support so we can actually track allocation
> and the hardware access at the various domain levels because
> some of these issues cannot be decoded by looking at a state
> snapshot in debugfs. With some of them we even can't access
> debugfs at all.
>
> Though one issue with that is, that for the early boot process
> there is no way to store that information as the tracer gets
> enabled way after init_IRQ(). But there is no reason why the
> tracer could not be enabled before that. All it needs is a
> working memory allocator. Steven?
>
> Now there is another class of problems which might be hard to
> debug. When the machine just boots into a hang, so we dont get a
> ftrace output neither from an oops nor from a console. It would
> be nice if we could have a command line option which prints
> enabled trace points via (early_)printk. That would avoid
> sending out ad hoc printk debug patches which will basically
> provide the same information as the trace_points. That would be
> useful for other hard to debug boot hangs as well. Steven?
>
> I think the above can be solved, so we need to agree on a proper
> set of tracepoints. I came up with the following list:
>
> - trace_irqdomain_create(domain->id, domain->name, ...)
> - trace_irqdomain_destroy(domain->id)
>
> - trace_irqdomain_alloc(irq_data)
>
> struct irq_data contains all relevant information for
> assigning the tracepoint data.
>
> __entry->virq = irq_data->virq;
> __entry->domainid = irq_data->domain;
> __entry->hwirq = irq_data->hwirq;
> TP_STORE_DATA(__entry->data, irq_data);
>
> Where TP_STORE_DATA checks for the above callback and uses it
> if available, otherwise we just clear the data field.
>
> So this reuses the callback which we want for debugfs
> anyway. The print format is just hexdump. See my above
> rationale for that.
>
> - trace_irqdomain_free(virq, domain->id)
>
> - trace_irqdomain_hw_access(irqdata)
>
> Same "data" and pretty printing argument as for
> trace_irqdomain_alloc()
>
> The obvious place to put such a trace point is
> e.g. irq_chip_write_msi_msg() where the callback records the
> currently written msi msg.
>
> Once we have sorted that, I'll push x86/apic into a seperate git
> repository so the history is preserved.
>
> After that I'll redo x86/apic from scratch with proper ordering and
> all fixes folded to the right places so the whole thing becomes
> bisectable.
>
> Thoughts?
This really sounds a good idea to debug interrupt.

So I will work on following items for 3.20:
1) Continue to convert PCI MSI code into generic MSI code
as much as possible.
2) Simplify interrupt remapping initialization on x86, the first
version has been posted at: https://lkml.org/lkml/2014/12/10/20.
3) Solve new bugs if any:)
Thanks!
Gerry

>
> Thanks,
>
> Thomas
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/