Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

From: Benjamin Herrenschmidt
Date: Wed Jun 12 2019 - 04:34:29 EST


On Wed, 2019-06-12 at 05:48 +0200, Borislav Petkov wrote:
> On Wed, Jun 12, 2019 at 08:25:52AM +1000, Benjamin Herrenschmidt wrote:
> > Yes, we would be in a world of pain already if tracepoints couldn't
> > handle concurrency :-)
>
> Right, lockless buffer and the whole shebang :)

Yup.

> > Sort-of... I still don't see a race in what we propose but I might be
> > missing something subtle. We are talking about two drivers for two
> > different IP blocks updating different counters etc...
>
> If you do only *that* you should be fine. That should technically be ok.

Yes, that' the point.

> I still think, though, that the sensible thing to do is have one
> platform driver which concentrates all RAS functionality.

I tend to disagree here. We've been down that rabbit hole in the past
and we (Linux in general) are trying to move away from that sort of
"platform" overarching driver as much as possible.

> It is the
> more sensible design and takes care of potential EDAC shortcomings and
> the need to communicate between the different logging functionality,
> as in, for example, "I had so many errors, lemme go and increase DRAM
> scrubber frequency." For example. And all the other advantages of having
> everything in a single driver.

This is a policy. It should either belong to userspace, or be in some
generic RAS code in the kernel, there's no reason why these can't be
abstracted. Also in your specific example, it could be entirely local
to the MC EDAC / DRAM controller path, we could have a generic way for
EDAC to advertise that a given memory channel is giving lots of errors
and have memory controller drivers listen to it but usually the EDAC MC
driver *is* the only thing that looks like a MC driver to begin with,
so again, pretty much no overlap with L1/L2 caches RAS or PCIe RAS
etc...

> And x86 already does that - we even have a single driver for all AMD
> platforms - amd64_edac. Intel has a couple but there's still a lot of
> sharing.

Unless I'm mistaken, that amd64 EDAC is just an MC one... but I only
had a cursory glance at the code.

> But apparently ARM folks want to have one driver per IP block. And we
> have this discussion each time a new vendor decides to upstream its
> driver. And there's no shortage of vendors in ARM-land trying to do
> that.

For good reasons :-)

> James and I have tried to come up with a nice scheme to make that work
> on ARM and he has an example prototype here:
>
> http://www.linux-arm.org/git?p=linux-jm.git;a=shortlog;h=refs/heads/edac_dummy/v1
>
> to show how it could look like.
>
> But I'm slowly growing a serious aversion against having this very same
> discussion each time an ARM vendor sends a driver. And that happens
> pretty often nowadays.

Maybe because what you are promoting might not be the right path
here... seriously, there's a reason why all vendors want to go down
that path and in this case I don't think they are wrong.

This isn't about just another ARM vendor, in fact I'm rather new to the
whole ARM thing, I used to maintain arch/powerpc :-) The point is what
you are trying to push for goes against everything we've been trying to
do in Linux when it comes to splitting drivers to individual IP blocks.

Yes, in *some* cases coordination will be needed in which case there
are ways to do that that don't necessarily involve matching a driver to
the root of the DT, and a pseudo-device is in fact a very reasonable
way to do it, it was a common practice in IEEE1275 before I invented
the FDT, and we do that for a number of other things already.

Cheers,
Ben.