Re: vector space exhaustion on 4.14 LTS kernels

From: Thomas Gleixner
Date: Wed Nov 21 2018 - 08:26:27 EST


Josh,

On Mon, 19 Nov 2018, Josh Hunt wrote:
> We have a class of machines that appear to be exhausting the vector space on
> cpus 0 and 1 which causes some breakage later on when trying to set the
> affinity. The boxes are running the 4.14 LTS kernel.
>
> [ 39.531385] __assign_irq_vector: irq:512 cpu:128 searched:00,00000001
> vector:00,00000000 continue
> [ 39.531386] apic_set_affinity: irq:512 mask:00,00000001 err:-28
>
> The affinity values:
>
> root@xxxxxxxxxxxxx:/proc/irq/512# grep . *
> affinity_hint:00,00000001
> effective_affinity:00,00000004
> effective_affinity_list:2
> grep: mlx5_comp0@pci:0000:65:00.1: Is a directory
> node:0
> smp_affinity:ff,ffffffff
> smp_affinity_list:0-39
> spurious:count 3
> spurious:unhandled 0
> spurious:last_unhandled 0 ms
>
> I noticed your change, a0c9259dc4e1 "irq/matrix: Spread interrupts on
> allocation", and this sounds like what we're hitting. Booting 4.19 does not
> have this problem. I haven't booted 4.15 yet, but can do it to confirm the
> above commit is what resolves this.

Might be, but in 4.15 the while vector allocation got rewritten. One of the
reasons was the exhaustion issue. Some of that is caused by massive over
allocation by certain device drivers. The new allocator mechanism handles
that way better.

> Since 4.14 doesn't have the matrix allocator it's not a trivial backport. I
> was wondering a) if you agree with my assessment and b) if there's any plans
> on resolving this on the 4.14 allocator? If not I can attempt to backport the
> idea to 4.14 to spread the interrupts around on allocation.

No plans. Good luck with trying to fix that on the 4.14 code. I'd recommend
to switch to 4.19 LTS :)

Thanks,

tglx