Re: Kernel-managed IRQ affinity (cont)

From: Ming Lei
Date: Thu Dec 19 2019 - 11:11:42 EST


On Thu, Dec 19, 2019 at 09:32:14AM -0500, Peter Xu wrote:
> On Thu, Dec 19, 2019 at 04:28:19PM +0800, Ming Lei wrote:
> > Hi Peter,
>
> Hi, Ming,
>
> >
> > On Mon, Dec 16, 2019 at 02:57:12PM -0500, Peter Xu wrote:
> > > Hi, Thomas,
> > >
> > > (Sorry I must have lost the discussion during an email migration, so
> > > I'll start with a new one)
> > >
> > > This is a continued discussion of previous one on kernel managed IRQ
> > > affinity [1]. I think at that time the conclusion is that we don't
> > > have a usage scenario to change current policy [2]. However recently
> > > I noticed that it is probably a very fundamental requirement for some
> > > real-time scenarios, even when there's no multi-queue involved.
> > >
> > > In my test case, it was a very common realtime guest with 10 vcpus,
> > > 0-1 are housekeeping vcpus, 2-9 are realtime vcpus. The guest has one
> > > virtio-blk device as boot disk. With a distribution very close to
> > > latest upstream, we can observe high spikes, probably due to the IRQs.
> > >
> > > To guarantee realtime responsiveness, we need to make sure the IRQs
> > > will be managable, say, when I run a real-time workload on vcpu9, we
> > > should be able to move all the IRQs from vcpu9 to the other vcpus
> > > (most probably vcpu0 and vcpu1). However with the kernel managed IRQs
> > > we can't echo to /proc/irq/N/smp_affinity. Here, vcpu9 gets IRQ 38
> > > from the virtio-blk device:
> > >
> > > # cat /proc/interrupts | grep -w 38
> > > 38: 0 0 0 0 0 0 0 0 0 15206 PCI-MSI 2621441-edge virtio2-req.0
> > > # cat /proc/irq/38/smp_affinity
> > > 3ff
> > > # cat /proc/irq/38/effective_affinity
> > > 200
> > >
> > > Meanwhile, I don't think there's anything special for VMs, so this
> > > issue should exist even for hosts as long as the IRQ is managed in the
> > > same way here as the virtio-blk device.
> > >
> > > As Ming has mentioned in previous discussions [3], I think it would be
> > > at least good if the kernel IRQ system can respect "irqaffinity=" when
> > > assigning IRQs to the cores. Currently it's not. What would you
> > > suggest in this case? Do you think this is a valid user scenario?
> > >
> > > Thanks,
> > >
> > > [1] https://lkml.org/lkml/2019/3/18/15
> > > [2] https://lkml.org/lkml/2019/3/25/562
> > > [3] https://lkml.org/lkml/2019/3/25/308
> >
> > The following patch supposes to implementation the requirement for you,
> > can you test it by passing 'isolcpus=managed_irq,X-Y'?
>
> I really appreciate your patch! I'll keep this version, while before
> I start to test it...
>
> >
> > With this kind of change, you can't run any IO from any isolated
> > CPU core, otherwise, unpredictable error may be triggered, either oops or
> > IO hang.
>
> ... I'm not sure whether this can be acceptable for a production
> environment.
>
> In our case, the IRQ should come from virtio-blk which is the root
> disk, so I assume even the RT core could use it at least when loading
> the executable into RAM. So...
>
> >
> > Another conservative approach is to only select effective CPU from
> > non-isolated cpus, however, the assigned CPUs may not be balanced among
> > interrupt vectors. But it is safer, since the system still works even if
> > someone submits IO from any isolated cpu core.
>
> ... this one seems to be more appealing at least to me.

OK, please try the following patch:


diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 6c8512d3be88..0fbcbacd1b29 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -13,6 +13,7 @@ enum hk_flags {
HK_FLAG_TICK = (1 << 4),
HK_FLAG_DOMAIN = (1 << 5),
HK_FLAG_WQ = (1 << 6),
+ HK_FLAG_MANAGED_IRQ = (1 << 7),
};

#ifdef CONFIG_CPU_ISOLATION
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 1753486b440c..0a75a09cc4e8 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -20,6 +20,7 @@
#include <linux/sched/task.h>
#include <uapi/linux/sched/types.h>
#include <linux/task_work.h>
+#include <linux/sched/isolation.h>

#include "internals.h"

@@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
{
struct irq_desc *desc = irq_data_to_desc(data);
struct irq_chip *chip = irq_data_get_irq_chip(data);
+ const struct cpumask *housekeeping_mask =
+ housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
int ret;
+ cpumask_var_t tmp_mask;

if (!chip || !chip->irq_set_affinity)
return -EINVAL;

- ret = chip->irq_set_affinity(data, mask, force);
+ if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
+ return -EINVAL;
+
+ /*
+ * Userspace can't change managed irq's affinity, make sure
+ * that isolated CPU won't be selected as the effective CPU
+ * if this irq's affinity includes both isolated CPU and
+ * housekeeping CPU.
+ *
+ * This way guarantees that isolated CPU won't be interrupted
+ * by IO submitted from housekeeping CPU.
+ */
+ if (irqd_affinity_is_managed(data) &&
+ cpumask_intersects(mask, housekeeping_mask))
+ cpumask_and(tmp_mask, mask, housekeeping_mask);
+ else
+ cpumask_copy(tmp_mask, mask);
+
+ ret = chip->irq_set_affinity(data, tmp_mask, force);
switch (ret) {
case IRQ_SET_MASK_OK:
case IRQ_SET_MASK_OK_DONE:
@@ -229,6 +251,8 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
ret = 0;
}

+ free_cpumask_var(tmp_mask);
+
return ret;
}

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 9fcb2a695a41..008d6ac2342b 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -163,6 +163,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
continue;
}

+ if (!strncmp(str, "managed_irq,", 12)) {
+ str += 12;
+ flags |= HK_FLAG_MANAGED_IRQ;
+ continue;
+ }
+
pr_warn("isolcpus: Error, unknown flag\n");
return 0;
}

--
Ming