Re: Kernel-managed IRQ affinity (cont)

From: Peter Xu
Date: Thu Dec 19 2019 - 09:32:21 EST


On Thu, Dec 19, 2019 at 04:28:19PM +0800, Ming Lei wrote:
> Hi Peter,

Hi, Ming,

>
> On Mon, Dec 16, 2019 at 02:57:12PM -0500, Peter Xu wrote:
> > Hi, Thomas,
> >
> > (Sorry I must have lost the discussion during an email migration, so
> > I'll start with a new one)
> >
> > This is a continued discussion of previous one on kernel managed IRQ
> > affinity [1]. I think at that time the conclusion is that we don't
> > have a usage scenario to change current policy [2]. However recently
> > I noticed that it is probably a very fundamental requirement for some
> > real-time scenarios, even when there's no multi-queue involved.
> >
> > In my test case, it was a very common realtime guest with 10 vcpus,
> > 0-1 are housekeeping vcpus, 2-9 are realtime vcpus. The guest has one
> > virtio-blk device as boot disk. With a distribution very close to
> > latest upstream, we can observe high spikes, probably due to the IRQs.
> >
> > To guarantee realtime responsiveness, we need to make sure the IRQs
> > will be managable, say, when I run a real-time workload on vcpu9, we
> > should be able to move all the IRQs from vcpu9 to the other vcpus
> > (most probably vcpu0 and vcpu1). However with the kernel managed IRQs
> > we can't echo to /proc/irq/N/smp_affinity. Here, vcpu9 gets IRQ 38
> > from the virtio-blk device:
> >
> > # cat /proc/interrupts | grep -w 38
> > 38: 0 0 0 0 0 0 0 0 0 15206 PCI-MSI 2621441-edge virtio2-req.0
> > # cat /proc/irq/38/smp_affinity
> > 3ff
> > # cat /proc/irq/38/effective_affinity
> > 200
> >
> > Meanwhile, I don't think there's anything special for VMs, so this
> > issue should exist even for hosts as long as the IRQ is managed in the
> > same way here as the virtio-blk device.
> >
> > As Ming has mentioned in previous discussions [3], I think it would be
> > at least good if the kernel IRQ system can respect "irqaffinity=" when
> > assigning IRQs to the cores. Currently it's not. What would you
> > suggest in this case? Do you think this is a valid user scenario?
> >
> > Thanks,
> >
> > [1] https://lkml.org/lkml/2019/3/18/15
> > [2] https://lkml.org/lkml/2019/3/25/562
> > [3] https://lkml.org/lkml/2019/3/25/308
>
> The following patch supposes to implementation the requirement for you,
> can you test it by passing 'isolcpus=managed_irq,X-Y'?

I really appreciate your patch! I'll keep this version, while before
I start to test it...

>
> With this kind of change, you can't run any IO from any isolated
> CPU core, otherwise, unpredictable error may be triggered, either oops or
> IO hang.

... I'm not sure whether this can be acceptable for a production
environment.

In our case, the IRQ should come from virtio-blk which is the root
disk, so I assume even the RT core could use it at least when loading
the executable into RAM. So...

>
> Another conservative approach is to only select effective CPU from
> non-isolated cpus, however, the assigned CPUs may not be balanced among
> interrupt vectors. But it is safer, since the system still works even if
> someone submits IO from any isolated cpu core.

... this one seems to be more appealing at least to me.

Thanks,

--
Peter Xu