Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

From: James Morse
Date: Thu Jun 13 2019 - 13:10:05 EST

Hi Hawa,

On 11/06/2019 20:56, Hawa, Hanna wrote:
> James Morse wrote:
>> Hawa, Hanna wrote:
>>> +ÂÂÂ edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
>> How do we know this was corrected?
>> 6.4.8 "Error Correction Code" has "Double-bit ECC errors set the fatal bit." in a
>> paragraph talking about the L1 memory system.

> I'll check fatal field to check if it's uncorrected/corrected.

>>> +ÂÂÂ edac_printk(KERN_CRIT, DRV_NAME, "CPU%d L1 %serror detected\n",
>>> +ÂÂÂÂÂÂÂÂÂÂÂ cpu, (fatal) ? "Fatal " : "");
>>> +ÂÂÂ edac_printk(KERN_CRIT, DRV_NAME, "RAMID=");
>>> +
>>> +ÂÂÂ switch (ramid) {
>>> +ÂÂÂ case ARM_CA57_L1_I_TAG_RAM:
>>> +ÂÂÂÂÂÂÂ pr_cont("'L1-I Tag RAM' index=%d way=%d", index, way);
>>> +ÂÂÂÂÂÂÂ break;
>>> +ÂÂÂ case ARM_CA57_L1_I_DATA_RAM:
>>> +ÂÂÂÂÂÂÂ pr_cont("'L1-I Data RAM' index=%d bank= %d", index, way);
>>> +ÂÂÂÂÂÂÂ break;
>> Is index/way information really useful? I can't replace way-3 on the system, nor can I
>> stop it being used. If its useless, I'd rather we don't bother parsing and printing it out.

> I'll remove the index/way information, and keep CPUMERRSR_EL1 value print (who want this
> information can parse the value and get the index/bank status)

Good idea, just print it raw.

>>> +ÂÂÂ pr_cont(", repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)\n", repeat, other,
>>> +ÂÂÂÂÂÂÂ val);
>> 'other' here is another error, but we don't know the ramid.
>> 'repeat' is another error for the same ramid.
>> Could we still feed this stuff into edac? This would make the counters accurate if the
>> polling frequency isn't quite fast enough.

> There is no API in EDAC to increase the counters, I can increase by accessing the
> ce_count/ue_count from edac_device_ctl_info/edac_device_instance structures if it's okay..

Ah, sorry, I was thinking of the edac_mc_handle_error(), which has an error_count parameter.

(I wouldn't go poking in the structures directly).

>>> +static void al_a57_edac_l2merrsr(void *arg)
>>> +{
>>> +ÂÂÂ edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
>> How do we know this is corrected?

>> If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is
>> this what you are depending on here?

> No - not on this. Reporting all the errors as corrected seems to be bad.
> Can i be depends on fatal field?

That is described as "set to 1 on the first memory error that caused a Data Abort". I
assume this is one of the parity-error external-aborts.

If the repeat counter shows, say, 2, and fatal is set, you only know that at least one of
these errors caused an abort. But it could have been all three. The repeat counter only
matches against the RAMID and friends, otherwise the error is counted in 'other'.

I don't think there is a right thing to do here, (other than increase the scrubbing
frequency). As you can only feed one error into edac at a time then:

> if (fatal)
> ÂÂÂÂedac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
> else
> ÂÂÂÂedac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

seems reasonable. You're reporting the most severe, and 'other/repeat' counter values just
go missing.

> How can L2CTLR_EL1[20] force fatal?

I don't think it can, on a second reading, it looks to be even more complicated than I
thought! That bit is described as disabling forwarding of uncorrected data, but it looks
like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush'
means in this context.)
I was looking for reasons you could 'know' that any reported error was corrected. This was
just a bad suggestion!

>> (it would be good to have a list of integration-time and firmware dependencies this driver
>> has, for the next person who tries to enable it on their system and complains it doesn't
>> work for them)

> Where can i add such information?

The mailing list archive is good enough. I'll ask about any obvious dependency I spot from
the TRM, (e.g. that list at the top of my first reply). If you know of anything weird
please call it out.

>>> +ÂÂÂ pr_cont(", cpuid/way=%d, repeat=%d, other=%d (L2MERRSR_EL1=0x%llx)\n",
>>> +ÂÂÂÂÂÂÂ way, repeat, other, val);
>> cpuid could be useful if you can map it back to the cpu number linux has.
>> If you can spot that cpu-7 is experiencing more errors than it should, you can leave it
>> offline.
>> To do this you'd need to map each L2MERRSR_EL1's '0-3' range back to the CPUs they
>> actually are. The gic's 'ppi-partitions' does this with phandles, e.g.
>> Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml. You could add a
>> similar shaped thing to the l2-cacheX node in the DT, (or in your edac node, but it is a
>> property of the cache integration).

> As in L1 prints, I'll remove non-relevant prints.

Fair enough.

>>> +ÂÂÂ /*
>>> +ÂÂÂÂ * Use get_online_cpus/put_online_cpus to prevent the online CPU map
>>> +ÂÂÂÂ * changing while reads the L1/L2 error status
>> For walking the list of offline cpus, this makes sense. But you schedule work without
>> waiting, it may get run after you drop the cpus_read_lock()...,

> Will update the smp_call_function_single() call function to wait.

>>> +ÂÂÂ for_each_online_cpu(cpu) {
>>> +ÂÂÂÂÂÂÂ /* Check L1 errors */
>>> +ÂÂÂÂÂÂÂ smp_call_function_single(cpu, al_a57_edac_cpumerrsr, edac_dev,
>> As you aren't testing for big/little, wouldn't on_each_cpu() here be simpler?

> Could be simpler for L1, how can be implemented for L2?

You'd need bitmasks for each cluster to feed to smp_call_function_any().

>>> +ÂÂÂÂÂÂÂ cluster = topology_physical_package_id(cpu);
>> Hmm, I'm not sure cluster==package is guaranteed to be true forever.
>> If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. Otherwise
>> pulling out the DT using something like the arch code's parse_cluster().

> I rely on that it's alpine SoC specific driver.

... and that the topology code hasn't changed to really know what a package is:

As what you really want to know is 'same L2?', and you're holding the cpu_read_lock(),
would struct cacheinfo's shared_cpu_map be a better fit?

This would be done by something like a cpu-mask of cache:shared_cpu_map's for the L2's
you've visited. It removes the dependency on package==L2, and insulates you from the
cpu-numbering not being exactly as you expect.

>>> +ÂÂÂÂÂÂÂ if (cluster != last_cluster) {
>>> +ÂÂÂÂÂÂÂÂÂÂÂ smp_call_function_single(cpu, al_a57_edac_l2merrsr,
>>> +ÂÂÂÂÂÂÂÂÂÂÂ last_cluster = cluster;
>>> +ÂÂÂÂÂÂÂ }
>> Here you depend on the CPUs being listed in cluster-order in the DT. I'm fairly sure the
>> numbering is arbitrary: On my Juno 0,3,4,5 are the A53 cluster, and 1,2 are the A57
>> cluster.
>> If 1,3,5 were cluster-a and 2,4,6 were cluster-b, you would end up calling
>> al_a57_edac_l2merrsr() for each cpu. As you don't wait, they could race.
>> If you can get a cpu-mask for each cluster, smp_call_function_any() would to the
>> pick-one-online-cpu work for you.

> Again, I rely on that it's alpine SoC specific driver.
> How can I get cpu-mask for each cluster? from DT?

Its not cluster you want, its the L2. Cacheinfo has this for online CPUs, and you're
already holding the cpus_read_lock().

>>> +static int al_a57_edac_remove(struct platform_device *pdev)
>>> +{
>>> +ÂÂÂ struct edac_device_ctl_info *edac_dev = platform_get_drvdata(pdev);
>>> +
>>> +ÂÂÂ edac_device_del_device(edac_dev->dev);
>>> +ÂÂÂ edac_device_free_ctl_info(edac_dev);
>> Your poll function schedule work on other CPUs and didn't wait, is it possible
>> al_a57_edac_l2merrsr() is still using this memory when you free it?

> This will be okay, after using wait in smp_call_function_single().