Re: [PATCH] arm64: Add support for new control bits CTR_EL0.IDC and CTR_EL0.IDC
From: Shanker Donthineni
Date: Mon Feb 19 2018 - 11:35:42 EST
Hi Catalin,
On 02/19/2018 08:38 AM, Catalin Marinas wrote:
> On Fri, Feb 16, 2018 at 06:57:46PM -0600, Shanker Donthineni wrote:
>> Two point of unification cache maintenance operations 'DC CVAU' and
>> 'IC IVAU' are optional for implementors as per ARMv8 specification.
>> This patch parses the updated CTR_EL0 register definition and adds
>> the required changes to skip POU operations if the hardware reports
>> CTR_EL0.IDC and/or CTR_EL0.IDC.
>>
>> CTR_EL0.DIC: Instruction cache invalidation requirements for
>> instruction to data coherence. The meaning of this bit[29].
>> 0: Instruction cache invalidation to the point of unification
>> is required for instruction to data coherence.
>> 1: Instruction cache cleaning to the point of unification is
>> not required for instruction to data coherence.
>>
>> CTR_EL0.IDC: Data cache clean requirements for instruction to data
>> coherence. The meaning of this bit[28].
>> 0: Data cache clean to the point of unification is required for
>> instruction to data coherence, unless CLIDR_EL1.LoC == 0b000
>> or (CLIDR_EL1.LoUIS == 0b000 && CLIDR_EL1.LoUU == 0b000).
>> 1: Data cache clean to the point of unification is not required
>> for instruction to data coherence.
>
> There is a difference between cache maintenance to PoU "is not required"
> and the actual instructions being optional (i.e. undef when executed).
> If your caches are transparent and DC CVAU/IC IVAU is not required,
> these instructions should behave as NOPs. So, are you trying to improve
> the performance of the cache maintenance routines in the kernel? If yes,
> please show some (relative) numbers and a better description in the
> commit log.
>
Yes, I agree with you, POU instructions are NOPs if the caches are transparent.
There was no issue as per correctness point of view. But causing the unnecessary
overhead in ASM routines where code goes thorough VA range incremented
by cache line size. This overhead is noticeable with 64K PAGE, especially with
sections mappings. I'll reword the commit text to reflect your comments in v2 patch.
e.g. 512M section with 64K PAGE_SIZE kernel, assume 64Bytes cache size.
flush_icache_range() consumes around 256M cpu cycles
Icache loop overhead: 512Mbytes / 64Bytes * 4 instructions per loop
Dcache loop overhead: 512Mbytes / 64Bytes * 4 instructions per loop
With this patch it takes less than ~1K cycles.
> On the patch, I'd rather have an alternative framework entry for no VAU
> cache maint required and some ret instruction at the beginning of the
> cache maint function rather than jumping out of the loop somewhere
> inside the cache maintenance code, penalising the CPUs that do require
> it.
>
Alternative framework might break things in case of CPU hotplug. I need one
more confirmation from you on incorporating alternative framework.
--
Shanker Donthineni
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.