Re: [PATCH] irqchip: gicv3-its: Use NUMA aware memory allocation for ITS tables

From: Marc Zyngier
Date: Mon Jul 10 2017 - 11:15:42 EST


On 10/07/17 15:57, Shanker Donthineni wrote:
> Hi Marc,
>
> On 07/10/2017 08:50 AM, Marc Zyngier wrote:
>> On 10/07/17 11:21, Ganapatrao Kulkarni wrote:
>>> Hi Marc,
>>>
>>> On Mon, Jul 10, 2017 at 2:53 PM, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
>>>> On 10/07/17 10:08, Ganapatrao Kulkarni wrote:
>>>>> On Mon, Jul 10, 2017 at 2:36 PM, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
>>>>>> On 10/07/17 09:48, Ganapatrao Kulkarni wrote:
>>>>>>> Hi Marc,
>>>>>>>
>>>>>>> On Mon, Jul 3, 2017 at 8:23 PM, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
>>>>>>>> Hi Shanker,
>>>>>>>>
>>>>>>>> On 03/07/17 15:24, Shanker Donthineni wrote:
>>>>>>>>> Hi Marc,
>>>>>>>>>
>>>>>>>>> On 06/30/2017 03:51 AM, Marc Zyngier wrote:
>>>>>>>>>> On 30/06/17 04:01, Ganapatrao Kulkarni wrote:
>>>>>>>>>>> On Fri, Jun 30, 2017 at 8:04 AM, Ganapatrao Kulkarni
>>>>>>>>>>> <gpkulkarni@xxxxxxxxx> wrote:
>>>>>>>>>>>> Hi Shanker,
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Jun 25, 2017 at 9:16 PM, Shanker Donthineni
>>>>>>>>>>>> <shankerd@xxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>> The NUMA node information is visible to ITS driver but not being used
>>>>>>>>>>>>> other than handling errata. This patch allocates the memory for ITS
>>>>>>>>>>>>> tables from the corresponding NUMA node using the appropriate NUMA
>>>>>>>>>>>>> aware functions.
>>>>>>>>>>>
>>>>>>>>>>> IMHO, the description would have been more constructive?
>>>>>>>>>>>
>>>>>>>>>>> "All ITS tables are mapped by default to NODE 0 memory.
>>>>>>>>>>> Adding changes to allocate memory from respective NUMA NODES of ITS devices.
>>>>>>>>>>> This will optimize tables access and avoids unnecessary inter-node traffic."
>>>>>>>>>>
>>>>>>>>>> But more importantly, I'd like to see figures showing the actual benefit
>>>>>>>>>> of this per-node allocation. Given that both of you guys have access to
>>>>>>>>>> such platforms, please show me the numbers!
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'll share the actual results which shows the improvement whenever
>>>>>>>>> available on our next chips. Current version of Qualcomm qdf2400 doesn't
>>>>>>>>> support multi socket configuration to capture results and share with you.
>>>>>>>>>
>>>>>>>>> Do you see any other issues with this patch apart from the performance
>>>>>>>>> improvements. I strongly believe this brings the noticeable improvement
>>>>>>>>> in numbers on systems where it has multi node memory/CPU configuration.
>>>>>>>>
>>>>>>>> I agree that it *could* show an improvement, but it very much depends on
>>>>>>>> how often the ITS misses in its caches. For this kind of patches, I want
>>>>>>>> to see two things:
>>>>>>>>
>>>>>>>> 1) It brings a measurable benefit on NUMA platforms
>>>>>>>
>>>>>>> Did some measurement of interrupt response time for LPIs and we don't
>>>>>>> see any major
>>>>>>> improvement due to caching of Tables. However, we have seen
>>>>>>> improvements of around 5%.
>>>>>>
>>>>>> An improvement of what exactly?
>>>>>
>>>>> interrupt response time.
>>>>
>>>> Measured how? On which HW? Using which benchmark?
>>>
>>> This has been tested on ThunderX2.
>>> We have instrumented gic-v3-its driver code to create dummy LPI device
>>> with few vectors.
>>> The LPI is induced from dummy device(through sysfs by writing to
>>> TRANSLATOR reg).
>>> The ISR routine(gic_handle_irq) being called to handle the induced LPI.
>>> NODE 1 cpu is used to induce LPI and NODE 1 cpu/collection is mapped
>>> in ITT to route this LPI.
>>>
>>> CPU timer counter are sampled at the time LPI is Induced and in ISR
>>> routine to calculate interrupt response time.
>>> the result shown improvement of 5% with this patch.
>>
>> And you call that a realistic measurement of the latency? Really? Sorry,
>> but I cannot take you seriously here.
>>
>>> Do you have any recommended benchmarks to test the same?
>>
>> Run a standard benchmark such as netperf, post the result with and
>> without that patch. The above is just plain ridiculous.
>>
>
> The whole purpose of ACPI subtable "GIC Interrupt Translation Service (ITS) Affinity structure"
> is to provide the proximity information to OS so that software will take advantage of NUMA
> aware allocations to improve the read latency of ITS/GICR tables, not just for implementing
> software workarounds.
>
>
> I believe ITS driver should provide NUMA aware allocations just like x86 Linux drivers. How much
> performance improvement we observer is based on the individual SOC implementation, inter NODE
> latency, inter node traffic, cache capacity, and type of the test used to measure results.
>
> Please consider this patch irrespective of the test results running on a specific hardware. We
> need this patch for upcoming Qualcomm server chips.

"I believe" and "We need" are not a proof of the usefulness of this. We
can argue all day, or you can provide a set of convincing results. Your
choice. But I can guarantee you the the latter is a much better method
than the former.

If you (or Cavium) cannot be bothered to provide tangible results that
this is useful, why should I take this at face value? This is just like
any other improvement we make to the kernel. We back it *with data*.

M.
--
Jazz is not dead. It just smells funny...