Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

From: James Morse
Date: Fri Jun 07 2019 - 11:16:03 EST


Hi guys,

On 06/06/2019 12:37, Shenhar, Talel wrote:
>>> Disagree. The various drivers don't depend on each other.
>>> I think we should keep the drivers separated as they are distinct and independent IP
>>> blocks.
>> But they don't exist in isolation, they both depend on the integration-choices/firmware
>> that makes up your platform.
>>
>> Other platforms may have exactly the same IP blocks, configured differently, or with
>> different features enabled in firmware. This means we can't just probe the driver based on
>> the presence of the IP block, we need to know the integration choices and firmware
>> settings match what the driver requires.
>>
>> (Case in point, that A57 ECC support is optional, another A57 may not have it)
>>
>> Descriptions of what firmware did don't really belong in the DT. Its not a hardware
>> property.
>>
>> This is why its better to probe this stuff based on the machine-compatible/platform-name,
>> not the presence of the IP block in the DT.
>>
>>
>> Will either of your separate drivers ever run alone? If they're probed from the same
>> machine-compatible this won't happen.
>>
>>
>> How does your memory controller report errors? Does it send back some data with an invalid
>> checksum, or a specific poison/invalid flag? Will the cache report this as a cache error
>> too, if its an extra signal, does the cache know what it is?
>>
>> All these are integration choices between the two IP blocks, done as separate drivers we
>> don't have anywhere to store that information. Even if you don't care about this, making
>> them separate drivers should only be done to make them usable on other platforms, where
>> these choices may have been different.

> From our perspective, l1/l2 has nothing to do with the ddr memory controller.

I understand you're coming from the position that these things have counters, you want
something to read and export them.

I'm coming at this from somewhere else. This stuff has to be considered all the way
through the system. Just because each component supports error detection, doesn't mean you
aren't going to get silent corruption. Likewise if another platform picks up two piecemeal
edac drivers for hardware it happens to have in common with yours, it doesn't mean we're
counting all the errors. This stuff has to be viewed for the whole platform.


> Its right that they both use same edac subsystem but they are using totally different APIs
> of it.
>
> We also even want to have separate control for enabling/disabling l1/l2 edac vs memory
> controller edac.

Curious, what for? Surely you either care about counting errors, or you don't.


> Even from technical point-of-view L1/L2 UE collection method is totally different from
> collecting memory-controller UE. (CPU exception vs actual interrupts).
>
> So there is less reason why to combine them vs giving each one its own file, e.g.
> al_mc_edac, al_l1_l2_edac (I even don't see why Hanna combined l1 and l2...)

> As we don't have any technical relation between the two we would rather avoid this
> combination.
>
> Also, Lets assume we have different setups with different memory controllers, having a dt
> binding to control the difference is super easy and flexible.

If the hardware is different you should describe this in the DT. I'm not suggesting you
don't describe it.

The discussion here is whether we should probe the driver based on a dummy-node
compatible, (which this 'edac_l1_l2' is) or based on the machine compatible.

At the extreme end: you should paint the CPU and cache nodes with a compatible describing
your integration. (I've mangled Juno's DT here:)
| A57_0: cpu@0 {
| compatible = "amazon-al,cortex-a57", "arm,cortex-a57";
| reg = <0x0 0x0>;
| device_type = "cpu";
| next-level-cache = <&A57_L2>;
| };
|
[...]
|
| A57_L2: l2-cache0 {
| compatible = "amazon-al,cache", "cache";
| cpu_map = <A57_0, A57_1>
| };


This is the most accurate way to describe what you have here. The driver can use this to
know that this integration of CPU and Cache support the edac registers. (This doesn't tell
us anything about whether firmware enabled this stuff, or made/left it all secure-only)

But this doesn't give you a device you can bind a driver to, to kick this stuff off.
This (I assume) is why you added a dummy 'edac_l1_l2' node, that just probes the driver.
The hardware is to do with the CPU and caches, 'edac_l1'_l2' doesn't correspond to any
distinct part of the soc.

The request is to use the machine compatible, not a dummy node. This wraps up the firmware
properties too, and any other platform property we don't know about today.

Once you have this, you don't really need the cpu/cache integration annotations, and your
future memory-controller support can be picked up as part of the platform driver.
If you have otherwise identical platforms with different memory controllers, OF gives you
the API to match the node in the DT.


> Would having a dedicated folder for amazon ease the move to separate files?

I don't think anyone cares about the number of files. Code duplication and extra
boiler-plate, maybe.


Thanks,

James