Re: [PATCH v5 02/33] genirq: Add irq_alloc_reserved_desc()

From: Thomas Gleixner
Date: Wed Jan 22 2014 - 19:03:51 EST


Yinghai,

On Thu, 2 Jan 2014, Yinghai Lu wrote:

> For ioapic hot-add support, it would be easy if we have continuous
> irq numbers for hot added ioapic controller.

I really don't care about easy. Easy to solve problems are for
wimps.

What you really want to say is, that ioapic hot-add support requires a
contiguous irq number space for a hotplugged ioapic to avoid expensive
translations in the ioapic hotplug code.

That's a proper reason for making that change to the core code.

> We can reserve irq range at first, and later allocate desc for those
> pre-reserved irqs when they are needed.
>
> The reasons for not allocating them during reserving:
> 1. only several pins of one ioapic are used, allocate for all pins, will
> waste memory for not used pins.
> 2. allocate later when is needed could make sure irq_desc is allocated
> on local node ram, as dev->node is set at that point.
>
> -v2: update changelog by adding reasons, requested by Konrad.
> -v3: according to tglx:
> separate core code change with arch code change.

Thanks for splitting the patches!

Now the scope of this change becomes more obvious and what I already
suspected before becomes crystal clear.

The initial intention of irq_reserve_irqs() was to cope with the
legacy interrupts to prevent the dynamic allocator from giving them
out, but it was at least a misnomer if not even a misconception.

Did you notice that? No!

Did you even think why irq_reserve_irqs() exists? No!

You just hacked it into submission for your purpose. As usual, sigh!

What prevents a user of __irq_alloc_reserved_desc() to request
something completely out of its range? Nothing as you happily return
an existing interrupt via:

+ if (irq_to_desc(irq))
+ return irq;

which is true for all already existing interrupts. So some random off
by one is going to cause a spurious and extremly hard to debug
issue. Brilliant.

No, we are not going to play the "it works for Yinghai" game again. I
wasted enough time with that already.

There is a clear step by step approach to get this done proper:

1) Get rid of the existing misconception/misnomer of
irq_reserve_irqs().

Make it explicit that this is dealing with legacy irq spaces. It's
not that hard as there are only two users in tree which are both
trivial to fix.

2) Provide a proper reservation mechanism which does not piggypack
blindly on the allocation bitmap.

So what you want is a reservation which:

A) Marks the irq range in the allocation bitmap

This prevents other code pathes to stomp on that range.

B) Stores a unique generated ID in a separate radix tree for that
particular irq range.

The generated ID is returned to the caller as it is required
for actually allocating an interrupt from that range.

We don't have to bother with making this conditional as the
initial memory consumption of the radix tree is minimal and we
only expand it when we actually use that hotplug feature.

3) Provide a proper alloc_reserved_irqdesc() function

This function verifies against the reservation ID which was
handed out by the reservation function.

It's questionable whether we want to allow the reuse of already
allocated irq descriptors. I'm leaning to avoid that. See #4

4) Provide a proper mechanism to free the registered irq descriptors
and the reservation range when the physical device is removed
from the system. So you don't have to preserve state in the
ioapic code. Physical hotplug is not a high frequency hotpath
operation.

5) Modify the x86 ioapic code to always use the reserve first and
allocate later mechanism to avoid ifdeffery and pointless
conditional code pathes. That also ensures proper test coverage.

TBH, I could not be bothered to look at your x86 related changes,
but I expect they are from the "make it work for Yinghai"
departement as well. I'll review them once the core code changes
are in an acceptable shape.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/