Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

From: Hawa, Hanna
Date: Mon Jun 17 2019 - 09:05:53 EST



+static void al_a57_edac_l2merrsr(void *arg)
+{

+ÂÂÂ edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

How do we know this is corrected?

If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is
this what you are depending on here?

No - not on this. Reporting all the errors as corrected seems to be bad.

Can i be depends on fatal field?

That is described as "set to 1 on the first memory error that caused a Data Abort". I
assume this is one of the parity-error external-aborts.

If the repeat counter shows, say, 2, and fatal is set, you only know that at least one of
these errors caused an abort. But it could have been all three. The repeat counter only
matches against the RAMID and friends, otherwise the error is counted in 'other'.

I don't think there is a right thing to do here, (other than increase the scrubbing
frequency). As you can only feed one error into edac at a time then:

if (fatal)
ÂÂÂÂedac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
else
ÂÂÂÂedac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

seems reasonable. You're reporting the most severe, and 'other/repeat' counter values just
go missing.
I had print the values of 'other/repeat' to be noticed.



How can L2CTLR_EL1[20] force fatal?

I don't think it can, on a second reading, it looks to be even more complicated than I
thought! That bit is described as disabling forwarding of uncorrected data, but it looks
like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush'
means in this context.)
I was looking for reasons you could 'know' that any reported error was corrected. This was
just a bad suggestion!
Is there interrupt for un-correctable error?
Does 'asynchronous errors' in L2 used to report UE?

In case no interrupt, can we use die-notifier subsystem to check if any error had occur while system shutdown?

+ÂÂÂÂÂÂÂ cluster = topology_physical_package_id(cpu);

Hmm, I'm not sure cluster==package is guaranteed to be true forever.

If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. Otherwise
pulling out the DT using something like the arch code's parse_cluster().

I rely on that it's alpine SoC specific driver.

... and that the topology code hasn't changed to really know what a package is:
https://lore.kernel.org/lkml/20190529211340.17087-2-atish.patra@xxxxxxx/T/#u

As what you really want to know is 'same L2?', and you're holding the cpu_read_lock(),
would struct cacheinfo's shared_cpu_map be a better fit?

This would be done by something like a cpu-mask of cache:shared_cpu_map's for the L2's
you've visited. It removes the dependency on package==L2, and insulates you from the
cpu-numbering not being exactly as you expect.
I'll add dt property that point to L2-cache node (phandle), then it'll be easy to create cpu-mask with all cores that point to same l2 cache.

Thanks,
Hanna