Re: [Patch V3 19/37] x86, irq: introduce mechanisms to support dynamically allocate IRQ for IOAPIC

From: Jiang Liu
Date: Wed May 28 2014 - 01:01:15 EST


Hi Thomas,
Thanks for your comments. Please refer to inline
comments below.

On 2014/5/28 3:58, Thomas Gleixner wrote:
> Jiang,
>
> On Tue, 27 May 2014, Jiang Liu wrote:
>
>> +static int alloc_irq_from_domain(struct irq_domain *domain, u32 gsi, int pin)
>> {
>> + int irq = -1;
>> +
>> + if (gsi >= arch_dynirq_lower_bound(0)) {
>> + irq = irq_create_mapping(domain, pin);
>> + } else if (gsi < NR_IRQS_LEGACY) {
>> + if (!ioapic_identity_map)
>> + irq = irq_create_mapping(domain, pin);
>> + else if (irq_domain_associate(domain, gsi, pin) == 0)
>> + irq = gsi;
>> + } else if (irq_create_strict_mappings(domain, gsi, pin, 1) == 0) {
>> + irq = gsi;
>> + }
>
> So you have these cases covered here:
>
> 1) The ACPI case of secondary ioapics. You only have the strict 1:1
> mapping for the first ioapic
>
> 2) The gsi < NR_IRQS_LEGACY case where you have two options:
>
> a) Let the core create a random virq number if ioapic_identity_map
> is 0
>
> ioapic_identity_map is only set by SFI and devicetree
>
> So in all other cases we fall into that code path for all
> legacy interrupts. So how is that supposed to work lets say for
> i8042 which has hardcoded irq 1 and 12?
>
> irq_create_mapping(1)
>
> hint = 1 % nr_irqs; --> 1
> virq = irq_alloc_desc_from(hint, of_node_to_nid(domain->of_node));
>
> This returns something >= 16, because the irq descriptors
> for 0-15 (LEGACY) are allocated already.
>
> The pin association works, but how is the i8042 driver supposed
> to figure out that it should request the virq >=16 which was
> created instead of the hardcoded 1 ?
This is used to work around special non-ISA interrupts with GSI below
NR_IRQS_LEGACY. The original code for the special case is:
/*
* Provide an identity mapping of gsi == irq except on truly
* weird platforms that have non isa irqs in the first 16 gsis.
*/
return gsi >= NR_IRQS_LEGACY ? gsi : gsi_top + gsi;

We have one path to handle ISA IRQs before calling
alloc_irq_from_domain() as below:
if (idx >= 0 && test_bit(mp_irqs[idx].srcbus, mp_bus_not_pci))
return mp_irqs[idx].srcbusirq;

>
> b) Associate the gsi and the pin
>
> This only works because the virqs are already allocated at boot
> time unconditionally due to arch_probe_nr_irqs() returning
> NR_IRQS_LEGACY. So irq_domain_associate() works.
> Undocumented works by chance behaviour.
Yes. It's a good suggestion to enhance legacy_pic to make this
code more clear.

>
> 3) The case where gsi < arch_dynirq_lower_bound()
>
> You create a strict mapping here, fine.
>
> This is confusing at best.
>
> First of all, we should use legacy_pic->nr_legacy_irqs instead of
> NR_IRQS_LEGACY all over the place.
>
> mshyperv, ce4100 and intel-mid use the null_legacy_pic which has
> nr_legacy_irqs = 0 and everything else uses the real pic which has
> nr_legacy_irqs = NR_IRQS_LEGACY. So why do we even bother to allocate
> and deal with NR_IRQS_LEGACY in the cases where we have no legacy?
I'm not sure whether it works with ce4100, so used NR_IRQS_LEGACY
instead of legacy_pic->nr_legacy_irqs for safety. Will try to refine
it in next version.

>
> ce4100 is an oddball though. The ioapic is registered way before the
> interrupt subsystem is initialized and I have a hard time to
> understand that comment:
>
> /* We can't set this earlier, because we need to calibrate the timer */
> legacy_pic = &null_legacy_pic;
I haven't figured out the story behind the comment yet:(

>
> The timer calibration happens after the interrupts are set up. I
> assume it's check_timer() which wants that, but we know exactly how
> the ce4100 works, so we might be able to avoid that whole "testing"
> stuff. Sebastian, any input on this?
>
> If it turns out that ce4100 needs the inital real legacy pic for some
> magic reason we still can be clever by extending the legacy pic data
> structure to tell us about that change, i.e. instead of using
> legacy_pic->nr_legacy_irqs having a field "nr_allocated_irqs", which
> is set to NR_IRQS_LEGACY for the real pic and to 0 for the null_pic
> and let ce4100 set that field to NR_IRQS_LEGACY before switching the
> legacy_pic over to the null implementation.
Good suggestion, will try this way.

> But what's really disgusting is the magic ioapic_identity_map and the
> extra ACPI specific ioapic_dynirq_base hackery.
>
> Why do we need strict mappings in the non ACPI case for all ioapic
> pins? What's so different about ACPI? Or is this just to avoid
> breaking the existing SFI/devicetree stuff. If that's the reason I'm
> fine with it, but ...
It's to avoid breaking SFI/intel_mid stuff. intel_mid assumes IRQ
number equals to pin number and use pci_dev->irq to save both IRQ
number and pin number.

>
> independent of this question we want to be more clever about the
> handling of strict, legacy and freely associated linux irq numbers.
>
> So you have this weird ioapic_create_domain_fn callback in
> mp_register_ioapic, which is solely there so the different callers can
> hand in their domain ops and eventually set the magic
> ioapic_identity_map flag.
>
> +void __init mp_register_ioapic(int id, u32 address, u32 gsi_base,
> + ioapic_create_domain_fn cb, void *arg)
>
> What about having
>
> struct ioapic_domain {
> struct irqdomain *domain;
> const struct irq_domain_ops *ops;
> void *arg;
> enum domain_type type;
> };
>
> and add this struct to the ioapic struct. type is:
>
> enum domain_type {
> IOAPIC_STRICT,
> IOAPIC_LEGACY,
> IOAPIC_DYNAMIC,
> };
>
>
> Now the register function changes to:
>
> void mp_register_ioapic(int id, u32 address, u32 gsi_base,
> const struct irq_domain_ops *ops,
> int type)
> {
> ...
>
> ioapics[idx].irqdomain.ops = ops;
> ioapics[idx].irqdomain.arg = arg;
> ioapics[idx].irqdomain.type = type;
> ioapics[idx].irqdomain.domain = NULL;
>
> ...
> }
>
> and you can use mp_irqdomain_create() unconditionally for creating all
> domains. And there you do:
>
> static int dynirq_lower_bound;
>
> mp_irqdomain_create()
> {
> ioapic->irqdomain.domain = irq_domain_add_linear(...);
>
> switch (ioapic->irqdomain.type) {
> case IOAPIC_LEGACY:
> /*
> * Associate the legacy interrupts which have been
> * already allocated.
> */
> for (i = 0; i < legacy_pic->nr_allocated_irqs; i++)
> irq_domain_associate(domain, i, i);
>
> case IOAPIC_STRICT:
> dynirq_lower_bound += ioapic->nr_gsis;
>
> case IOAPIC_DYNAMIC:
> break;
> }
> }
>
> So arch_dynirq_lower_bound() gets simplified to:
> {
> return dynirq_lower_bound;
> }
>
> And alloc_irq_from_domain() becomes:
>
> int alloc_irq_from_domain(struct ioapic_domain *domain, int gsi, int pin)
> {
> switch (domain->type) {
> case IOAPIC_DYNAMIC:
> return irq_create_mapping(domain->domain, pin);
> case IOAPIC_LEGACY:
> case IOAPIC_STRICT:
> return irq_create_strict_mappings(domain, gsi, pin, 1);
> }
> }
>
> At the call site of alloc_irq_from_domain() you have already:
>
> irq = irq_find_mapping(domain, pin);
> if (irq <=0 ....)
> alloc_irq_from_domain(domain, gsi, pin);
>
> So because we associated the legacy_pic->nr_allocated_irqs in
> mp_irqdomain_create() already, you'll never call into
> alloc_irq_from_domain() for those and the remaining ones for that
> first ioapic are handled by the IOAPIC_STRICT fall through.
>
> For simplicity you can let SFI and devicetree register the ioapics
> with their specific domain ops plus IOAPIC_LEGACY for the first ioapic
> and IOAPIC_STRICT for all others. That also covers the case where the
> null_legacy_pic with legacy_pic->nr_allocated_irqs == 0 is used.
>
> In the ACPI case you register with the acpi domain ops and
> IOAPIC_LEGACY for the first and IOAPIC_DYNAMIC for the extra ioapics.
>
> Should be way cleaner and understandable, at least to me :)
Really good suggestions! Make thing much more clear.

>
> Now there is that last oddity which bugs me in mp_map_pin_to_irq()
>
> /*
> * Don't use irqdomain to manage ISA IRQs because there may be
> * multiple IOAPIC pins sharing the same ISA IRQ number and
> * irqdomain only supports 1:1 mapping between IOAPIC pin and
> * IRQ number.
> */
> if (idx >= 0 && test_bit(mp_irqs[idx].srcbus, mp_bus_not_pci)) {
> irq = mp_irqs[idx].srcbusirq;
> if ((flags & IOAPIC_MAP_ALLOC) && info->count == 0 &&
> mp_irqdomain_map(domain, irq, pin) != 0)
> irq = -1;
>
> That really looks like a hack. I'm aware that the current irqdomain
> code cannot deal with that oddball case.
>
> So what you are saying is that there are devices which have a separate
> physical wire to different ioapic pins, but the ioapic is supposed to
> bundle them to a shared interrupt.
>
> I agree that this is odd enough to handle at the ioapic level, but
> it'd be nice to have a more elaborative comment on this.
Will try to improve the comment.

>
> Aside of the above I'm pretty happy about the progress of this patch
> set. One thing, which needs to be looked at are the usage sites of
> irq_data->irq, whether they are safe. I did not spot any unsafe ones,
> but a few functions which are called with irq_data->irq and make no
> use of it.
Sure, will check usages of irq_data->irq.

Really appreciate you suggestions, it will improve the code a lot.
Thanks!
Gerry
>
> Thanks,
>
> tglx
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/