Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

From: Hawa, Hanna
Date: Tue Jun 11 2019 - 16:00:48 EST

Hi James,

Allowing linux to access these implementation-defined registers has come up before:

It looks like you've navigated most of the issues. Accessing implementation-defined
registers is frowned on, but this stuff can't be done generically until v8.2.
Sure, no planning to do this generally for all ARM a57/a72. I'm doing this specific for alpine SoCs.

This can't be done on 'all A57/A72' because some platforms may not have been integrated to
have error-checking at L1/L2, (1.3 'Features' of [0] says the ECC protection for L1 data
cache etc is optional). Even if they did, this stuff needs turning on in L2CTLR_EL1.
These implementation-defined registers may be trapped by the hypervisor, I assume for your
platform this is linux booted at EL2, so no hypervisor.
In Alpine-v2/Alpine-v3 Bit[21]-"ECC and parity enable" in L2CTRL_EL1 is enabled in the firmware.


(GENMASK() would make it quicker and easier to compare this with the datasheet)
Will be used in next patchset.

+#define ARM_CA57_L2_INC_PLRU_RAM 0x18

A57 describes this one as 'PF RAM'...
will be updated.

Linux supports versions of binutils that choke on this syntax.
See the sys_reg() definitions in arm64's asm/sysreg.h that define something you can feed
to read_sysreg_s(). It would save having these wrapper functions.

commit 72c583951526 ("arm64: gicv3: Allow GICv3 compilation with older binutils") for the
Great, I'll use sys_reg(), read_sysreg_s(), and remove the wrapper functions.

| if (!(val & ARM_CA57_CPUMERRSR_VALID))

would be easier to read, the same goes for 'fatal' as its a single bit.
Will be fixed, here and in al_a57_edac_l2merrsr.

+ edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

How do we know this was corrected?

6.4.8 "Error Correction Code" has "Double-bit ECC errors set the fatal bit." in a
paragraph talking about the L1 memory system.
I'll check fatal field to check if it's uncorrected/corrected.

"L2 Error" ? Copy and paste?
copy/paste mistake, will be fixed.

+ edac_printk(KERN_CRIT, DRV_NAME, "CPU%d L1 %serror detected\n",
+ cpu, (fatal) ? "Fatal " : "");
+ edac_printk(KERN_CRIT, DRV_NAME, "RAMID=");
+ switch (ramid) {
+ case ARM_CA57_L1_I_TAG_RAM:
+ pr_cont("'L1-I Tag RAM' index=%d way=%d", index, way);
+ break;
+ case ARM_CA57_L1_I_DATA_RAM:
+ pr_cont("'L1-I Data RAM' index=%d bank= %d", index, way);
+ break;

Is index/way information really useful? I can't replace way-3 on the system, nor can I
stop it being used. If its useless, I'd rather we don't bother parsing and printing it out.
I'll remove the index/way information, and keep CPUMERRSR_EL1 value print (who want this information can parse the value and get the index/bank status)

+ pr_cont(", repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)\n", repeat, other,
+ val);

'other' here is another error, but we don't know the ramid.
'repeat' is another error for the same ramid.

Could we still feed this stuff into edac? This would make the counters accurate if the
polling frequency isn't quite fast enough.
There is no API in EDAC to increase the counters, I can increase by accessing the ce_count/ue_count from edac_device_ctl_info/edac_device_instance structures if it's okay..

+static void al_a57_edac_l2merrsr(void *arg)

+ edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

How do we know this is corrected?

If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is
this what you are depending on here?
No - not on this. Reporting all the errors as corrected seems to be bad.

Can i be depends on fatal field?

if (fatal)
edac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

How can L2CTLR_EL1[20] force fatal?

(it would be good to have a list of integration-time and firmware dependencies this driver
has, for the next person who tries to enable it on their system and complains it doesn't
work for them)
Where can i add such information?

+ case ARM_CA57_L2_INC_PLRU_RAM:
+ pr_cont("'L2 Inclusion PLRU RAM'");

The A57 TRM describes this as "Inclusion PF RAM", and notes its only in r1p0 or later,
(but doesn't say what it is). The A72 TRM describes the same encoding as "Inclusion PLRU
RAM", which is something to do with its replacement policy. It has control bits that A57's
version doesn't, so these are not the same thing.

Disambiguating A57/A72 here is a load of faff, 'L2 internal metadata' probably covers both
cases, but unless these RAMs are replaceable or can be disabled, there isn't much point
working out which one it was.
Will be fixed to 'L2 internal metadata'

+ pr_cont(", cpuid/way=%d, repeat=%d, other=%d (L2MERRSR_EL1=0x%llx)\n",
+ way, repeat, other, val);

cpuid could be useful if you can map it back to the cpu number linux has.
If you can spot that cpu-7 is experiencing more errors than it should, you can leave it

To do this you'd need to map each L2MERRSR_EL1's '0-3' range back to the CPUs they
actually are. The gic's 'ppi-partitions' does this with phandles, e.g.
Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml. You could add a
similar shaped thing to the l2-cacheX node in the DT, (or in your edac node, but it is a
property of the cache integration).
As in L1 prints, I'll remove non-relevant prints.

+ /*
+ * Use get_online_cpus/put_online_cpus to prevent the online CPU map
+ * changing while reads the L1/L2 error status

For walking the list of offline cpus, this makes sense. But you schedule work without
waiting, it may get run after you drop the cpus_read_lock()...,
Will update the smp_call_function_single() call function to wait.

+ get_online_cpus();

The comment above these helpers is:
| /* Wrappers which go away once all code is converted */

Will be updated.

+ for_each_online_cpu(cpu) {
+ /* Check L1 errors */
+ smp_call_function_single(cpu, al_a57_edac_cpumerrsr, edac_dev,
+ 0);

As you aren't testing for big/little, wouldn't on_each_cpu() here be simpler?
Could be simpler for L1, how can be implemented for L2?

As you don't wait, what stops al_a57_edac_cpumerrsr() feeding two errors into
edac_device_handle_ce() at the same time? Do you need a spinlock in al_a57_edac_cpumerrsr()?
Will call smp_call_function_single() with wait.

+ cluster = topology_physical_package_id(cpu);

Hmm, I'm not sure cluster==package is guaranteed to be true forever.

If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. Otherwise
pulling out the DT using something like the arch code's parse_cluster().
I rely on that it's alpine SoC specific driver.

+ if (cluster != last_cluster) {
+ smp_call_function_single(cpu, al_a57_edac_l2merrsr,
+ edac_dev, 0);
+ last_cluster = cluster;
+ }

Here you depend on the CPUs being listed in cluster-order in the DT. I'm fairly sure the
numbering is arbitrary: On my Juno 0,3,4,5 are the A53 cluster, and 1,2 are the A57 cluster.

If 1,3,5 were cluster-a and 2,4,6 were cluster-b, you would end up calling
al_a57_edac_l2merrsr() for each cpu. As you don't wait, they could race.

If you can get a cpu-mask for each cluster, smp_call_function_any() would to the
pick-one-online-cpu work for you.
Again, I rely on that it's alpine SoC specific driver.
How can I get cpu-mask for each cluster? from DT?

+static int al_a57_edac_remove(struct platform_device *pdev)
+ struct edac_device_ctl_info *edac_dev = platform_get_drvdata(pdev);
+ edac_device_del_device(edac_dev->dev);
+ edac_device_free_ctl_info(edac_dev);

Your poll function schedule work on other CPUs and didn't wait, is it possible
al_a57_edac_l2merrsr() is still using this memory when you free it?
This will be okay, after using wait in smp_call_function_single().



To match the SPDX header?
Will be fixed.

Thanks for your detailed review.