RE: [PATCH] Choose CPU based on allocated IRQs

From: Long Li
Date: Tue Oct 30 2018 - 13:45:01 EST


> Subject: Re: [PATCH] Choose CPU based on allocated IRQs
>
> Long,
>
> On Tue, 23 Oct 2018, Long Li wrote:
>
> thanks for this patch.
>
> A trivial formal thing ahead. The subject line
>
> [PATCH] Choose CPU based on allocated IRQs
>
> is lacking a proper subsystem prefix. In most cases you can figure the prefix
> out by running 'git log path/to/file' which in this case will show you that most
> commits touching this file use the prefix 'genirq/matrix:'.
>
> So the proper subject would be:
>
> [PATCH] genirq/matrix: Choose CPU based on allocated IRQs
>
> Subsystem prefixes are important to see where a patch belongs to right from
> the subject. Without that it could belong to any random part of the kernel
> and needs further inspection of the patch itself. This applies to both email
> and to git shortlog listings.

Thank you. I will send v2 to address this.

>
> > From: Long Li <longli@xxxxxxxxxxxxx>
> >
> > In irq_matrix, "available" is set when IRQs are allocated earlier in
> > the IRQ assigning process.
> >
> > Later, when IRQs are activated those values are not good indicators of
> > what CPU to choose to assign to this IRQ.
>
> Can you please explain why you think that available is the wrong indicator
> and which problem you are trying to solve?
>
> The WHY is really the most important part of a changelog.

The problem I'm seeing is that on a very large system with multiple devices of the same class (e.g. NVMe disks, using managed IRQs), they tend to use interrupts on several CPUs on the system. Under heavy load, those several CPUs are busy while other CPU are most idling. The issue is that when NVMe call irq_matrix_alloc_managed(), the assigned the CPU is always the first CPU in the cpumask, because they check for cpumap->available that will not change after managed IRQs are reserved in irq_matrix_reserve_managed (which was called from the 1st stage of IRQ setup in irq_domain_ops->alloc).

>
> > Change to choose CPU for an IRQ based on how many IRQs are already
> > allocated on this CPU.
>
> Looking deeper. The initial values are:
>
> available = alloc_size - (managed + systembits)
> allocated = 0
>
> There are two distinct functionalities which modify 'available' and 'allocated'
> (omitting the reverse operations for simplicity):
>
> 1) managed interrupts
>
> reserve_managed()
> managed++;
> available--;
>
> alloc_managed()
> allocated++;
>
> 2) regular interrupts
>
> alloc()
> allocated++;
> available--;
>
> So 'available' can be lower than 'allocated' depending on the number of
> reserved managed interrupts, which have not yet been activated.
>
> So for all regular interrupts we really want to look at the number of 'available'
> vectors because the reserved managed ones are already accounted there
> and they need to be taken into account.

I think "reserved managed" may not always be accurate. Reserved managed IRQs may not always get activated. For an irq_data, when irq_matrix_reserve_managed is called, all the CPUs in the cpumask are reserved. Later, only one of them is activated via the call to irq_matrix_alloc_managed(). So we end up with a number of "reserved managed" that never get used.

>
> For the spreading of managed interrupts in alloc_managed() that's indeed a
> different story and 'allocated' is more correct. But even that is not completely
> accurate and can lead to the wrong result. The accurate solution would be to
> account the managed _and_ allocated vectors separately and do the
> spreading for managed interrupts based on that.

I think checking for "allocated" is the best approach for picking which CPU to assign for a given irq_data, since we really can't rely on "managed" to decide how busy this CPU really is. Checking for "allocated" should work for both unmanaged and managed IRQs.

>
> Thanks,
>
> tglx