RE: Affinity managed interrupts vs non-managed interrupts

From: Kashyap Desai
Date: Mon Sep 03 2018 - 02:10:59 EST


> > It is not yet finalized, but it can be based on per sdev outstanding,
> > shost_busy etc.
> > We want to use special 16 reply queue for IO acceleration (these
queues are
> > working interrupt coalescing mode. This is a h/w feature)
>
> This part is very key to your approach, so I'd suggest to finalize it
> first. That said this way doesn't make sense if you can't figure out
> one doable approach to decide when to use the coalescing mode, and when
> to
> use the regular 72 reply queues.
This is almost finalized, but going through testing and may take some time
to review all the output.
At very high level -
If scsi device is Virtual Disk, it will count each physical disk for data
arm and required condition to use io acceleration (interrupt coalescing)
path is - outstanding for sdev should be more than 8 * data_arms. Using
this method we are not going to impact low latency intensive workload.

>
> If it is just for IO acceleration, why not always use the coalescing
mode?

Ming, we attempted all the possible approaches. Let me summarize.

If we use *all* interrupt coalescing, single worker and lower queue depth
profile is impacted and latency drop is seen upto 20%.

>
> >
> > >
> > > Frankly speaking, you may reuse the 72 reply queues to do interrupt
> > > coalescing by configuring one extra register to enable the
coalescing
> > > mode,
> > > and you may just use small part of the 72 reply queues under the
> > > interrupt coalescing mode.
> > Our h/w can set interrupt coalescing per 8 reply queues. So smallest
is 8.
> > If we choose to take 8 reply queue from existing 72 reply queue
(without
> > asking for extra reply queue), we still have an issue on more numa
node
> > systems. Example - in 8 numa node system each node will have only
*one*
> > reply queue for effective interrupt coalescing. (since irq subsystem
will
> > spread msix per numa).
> >
> > To keep things scalable we cherry picked few reply queues and wanted
them
> to
> > be out of cpu-msix mapping.
>
> I mean you can group the reply queues according to the queue's numa node
> info, given the mapping has been figured out there by genirq affinity
> code.

Not able to follow you. I replied to Thomas on the same topic. Is that
reply clarifies or I am still missing ?

>
> >
> > >
> > > Or you can learn from SPDK to use one or small number of dedicated
cores
> > > or kernel threads to poll the interrupts from all reply queues, then
I
> > > guess you may benefit much compared with the extra 16 queue
approach.
> > Problem with polling - It requires some steady completion, otherwise
> > prediction in driver gives different results on different profiles.
> > We attempted irq-poll and thread ISR based polling, but it has pros
and
> > cons. One of the key usage of method what we are trying is not to
impact
> > latency for lower QD workloads.
>
> Interrupt coalescing should effect latency too[1], or could you share
your
> idea how to use interrupt coalescing to address the latency issue?
>
> "Interrupt coalescing, also known as interrupt moderation,[1] is a
> technique in which events which would normally trigger a hardware
> interrupt
> are held back, either until a certain amount of work is pending,
or a
> timeout timer triggers."[1]
>
> [1] https://en.wikipedia.org/wiki/Interrupt_coalescing

That is correct. We are not going to use 100% interrupt coalescing to
avoid latency impact. We will have two set of queues. You can consider
this as hybrid interrupt coalescing.
On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues (msix
index). Only first 16 reply queue will be configured in interrupt
coalescing mode (This is special h/w feature.) and remaining 72 reply are
without any interrupt coalescing. 72 reply queue are 1:1 cpu-msix map and
16 reply queue are mapped to local numa node.

As explained above, per scsi device outstanding is a key factors to route
io to queues with interrupt coalescing vs regular queue (without interrupt
coalescing.)
Example -
If there are sync IO request per scsi device (one IO at a time), driver
will keep posting those IO to the queues without any interrupt coalescing.
If there are more than 8 outstanding io per scsi device, driver will post
those io to reply queues with interrupt coalescing. This particular group
of io will not have latency impact because coalescing depth are key
factors to flush the ios. There can be some corner cases of workload which
can theoretically possible to have latency impact, but having more scsi
devices doing active io submission will close that loop and we are not
suspecting those issue need any special treatment. In fact, this solution
is to provide reasonable latency + higher iops for most of the cases and
if there are some deployment which need tuning..it is still possible to
disable this feature. We really want to deal with those scenario on case
by case bases (through firmware settings).


>
> > I posted RFC at
> > https://www.spinics.net/lists/linux-scsi/msg122874.html
> >
> > We have done extensive study and concluded to use interrupt coalescing
is
> > better if h/w can manage two different modes (coalescing on/off).
>
> Could you explain a bit why coalescing is better?

Actually we are doing hybrid coalescing. You are correct, we have no
single answer here, but there are pros and cons.
For such hybrid coalescing we need h/w support.

>
> In theory, interrupt coalescing is just to move the implementation into
> hardware. And the IO submitted from the same coalescing group is usually
> irrelevant. The same problem you found in polling should have been in
> coalescing too.

Coalescing either in software or hardware is best attempt mechanism and
there is no steady snapshot of submission and completion in both the case.

One of the problem with coalescing/polling in OS driver is - Irq-poll
works in interrupt context and waiting in polling consume more CPU because
driver should do some predictive loop. At the same time driver should quit
after some completion to give fairness to other devices. Threaded
interrupt can resolve the cpu hogging issue, but we are moving our key
interrupt processing to threaded context so fairness will be compromised.
In case of threaded interrupt polling we may be impacted if interrupt of
other devices request the same cpu where threaded isr is running. If
polling logic in driver does not work well on different systems, we are
going to see extra penalty of doing disable/enable interrupt call. This
particular problem is not a concern if h/w does interrupt coalescing.

>
> >
> > >
> > > Introducing extra 16 queues just for interrupt coalescing and making
it
> > > coexisting with the regular 72 reply queues seems one very unusual
use
> > > case, not sure the current genirq affinity can support it well.
> >
> > Yes. This is unusual case. I think it is not used by any other
drivers.
> >
> > >
> > > > >
> > > > > >
> > > > > > All pre_vectors (16) will be mapped to all available online
CPUs but
> > > > > > e
> > > > > > ffective affinity of each vector is to CPU 0. Our requirement
is to
> > > > > > have pre _vectors 16 reply queues to be mapped to local NUMA
node
> > > with
> > > > > > effective CPU should be spread within local node cpu mask.
Without
> > > > > > changing kernel code, we can
> > > > >
> > > > > If all CPUs in one NUMA node is offline, can this use case work
as
> > > > expected?
> > > > > Seems we have to understand what the use case is and how it
works.
> > > >
> > > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity
will be
> > > > broken and irqbalancer takes care of migrating affected IRQs to
online
> > > > CPUs of different NUMA node.
> > > > When offline CPUs are onlined again, irqbalancer restores
affinity.
> > >
> > > irqbalance daemon can't cover managed interrupts, or you mean
> > > you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)?
> >
> > Yes. We did not used " pci_alloc_irq_vectors_affinity".
> > We used " pci_enable_msix_range" and manually set affinity in driver
using
> > irq_set_affinity_hint.
>
> Then you have to cover all kind of CPU hotplug issues in your driver
> because you switch to driver to maintain the queue mapping.
>
> Thanks,
> Ming