Re: Virtio-scsi multiqueue irq affinity
From: Ming Lei
Date: Mon Mar 25 2019 - 03:06:47 EST
On Mon, Mar 25, 2019 at 01:02:13PM +0800, Peter Xu wrote:
> On Sat, Mar 23, 2019 at 06:15:59PM +0100, Thomas Gleixner wrote:
> > Peter,
>
> Hi, Thomas,
>
> >
> > On Mon, 18 Mar 2019, Peter Xu wrote:
> > > I noticed that starting from commit 0d9f0a52c8b9 ("virtio_scsi: use
> > > virtio IRQ affinity", 2017-02-27) the virtio scsi driver is using a
> > > new way (via irq_create_affinity_masks()) to automatically initialize
> > > IRQ affinities for the multi-queues, which is different comparing to
> > > all the other virtio devices (like virtio-net, which still uses
> > > virtqueue_set_affinity(), which is actually, irq_set_affinity_hint()).
> > >
> > > Firstly, it will definitely broke some of the userspace programs with
> > > that when the scripts wanted to do the bindings explicitly like before
> > > and they could simply fail with -EIO now every time when echoing to
> > > /proc/irq/N/smp_affinity of any of the multi-queues (see
> > > write_irq_affinity()).
> >
> > Did it break anything? I did not see a report so far. Assumptions about
> > potential breakage are not really useful.
>
> It broke some automation scripts e.g. where they tried to bind CPUs to
> IRQs before staring IO but these scripts failed early during setup
> when trying to echo into the affinity procfs file. Actually I started
> to look into this because of such script breakage reported by QEs.
> Iinitially it was thought as a kernel bug but later we noticed that
> it's a change in policy.
>
> >
> > > Is there any specific reason to do it with the new way? Since AFAIU
> > > we should still allow the system admins to decide what to do for such
> > > configurations, .e.g., what if we only want to provision half of the
> > > CPU resources to handle IRQs for a specific virtio-scsi controller?
> > > We won't be able to achieve that with current policy. Or, could this
> > > be a question for the IRQ system (irq_create_affinity_masks()) in
> > > general? Any special considerations behind the big picture?
> >
> > That has nothing to do with the irq subsystem. That merily provides the
> > mechanisms.
> >
> > The reason behind this is that multi-queue devices set up queues per cpu or
> > if not enough queues are available queues per cpu groups. So it does not
> > make sense to move the interrupt away from the CPU or the CPU group.
> >
> > Aside of that in the CPU hotunplug case, interrupts used to be moved to the
> > online CPUs which resulted in problems for e.g. hibernation because on
> > large systems moving all interrupts to the boot CPU does not work due to
> > vector space exhaustion. Also CPU hotunplug is used for power management
> > purposes and there it does not make sense either to have the per cpu queues
> > of the offlined CPUs moved to the still online CPUs which then end up with
> > several queues.
> >
> > The new way to deal with this is to strictly bind per CPU (per CPU group)
> > queues. If the CPU or the last CPU in the group goes offline the following
> > happens:
> >
> > 1) The queue is disabled, i.e. no new requests can be queued
> >
> > 2) Wait for the outstanding requests to complete
> >
> > 3) Shut down the interrupt
> >
> > This avoids having multiple queues moved to the still online CPUs and also
> > prevents vector space exhaustion because the shut down interrupt does not
> > have to be migrated.
> >
> > When the CPU (or the first in the group) comes online again:
> >
> > 1) Reenable the interrupt
> >
> > 2) Reenable the queue
> >
> > Hope that helps.
>
> Thanks for explaining everything! It helps a lot, and yes it makes
> perfect sense to me.
>
> If no one reported any issue I think either the scripts are not
> checking the return code so they might fail silently but it might not
> matter much (e.g., if the only thing that a script wants to do is to
> spread the CPUs upon the IRQs then the script can simply cancel the
> setup procedure of this, and even failing of those echos won't affect
> much too), or they're just simpled fixed up later on. Now the only
> thing I am unsure about is whether there could be scenarios that we
> may not want to use the default policy to spread the cores.
>
> One thing I can think of is the real-time scenario where "isolcpus="
> is provided, then logically we should not allow any isolated CPUs to
> be bound to any of the multi-queue IRQs. Though Ming Lei and I had a
So far, this behaviour is made by user-space.
>From my understanding, IRQ subsystem doesn't handle "isolcpus=", even
though the Kconfig help doesn't mention irq affinity affect:
Make sure that CPUs running critical tasks are not disturbed by
any source of "noise" such as unbound workqueues, timers, kthreads...
Unbound jobs get offloaded to housekeeping CPUs. This is driven by
the "isolcpus=" boot parameter.
Yeah, some RT application may exclude 'isolcpus=' from some IRQ's
affinity via /proc/irq interface, and now it becomes not possible any
more to do that for managed IRQ.
> discussion offlist before and Ming explained to me that as long as the
> isolated CPUs do not generate any IO then there will be no IRQ on
> those isolated (real-time) CPUs at all. Can we guarantee that? Now
It is only guaranteed for 1:1 mapping.
blk-mq uses managed IRQ's affinity to setup queue mapping, for example:
1) single hardware queue
- this queue's IRQ affinity includes all CPUs, then the hardware queue's
IRQ is only fired on one specific CPU for IO submitted from any CPU
2) multi hardware queue
- there are N hardware queues
- for each hardware queue i(i < N), its IRQ's affinity may include N(i) CPUs,
then IRQ for this hardware queue i is fired on one specific CPU among N(i).
Thanks,
Ming