Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

From: Keith Busch
Date: Wed Feb 27 2019 - 12:51:10 EST


On Wed, Feb 27, 2019 at 04:42:05PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote:
> On 2/26/19 7:02 PM, Linus Torvalds wrote:
> > On Tue, Feb 26, 2019 at 2:37 PM <Alex_Gagniuc@xxxxxxxxxxxx> wrote:
> >>
> >> Then nobody gets the (error) message. You can go a bit further and try
> >> 'pcie_ports=native". Again, nobody gets the memo. ):
> >
> > So? The error was bogus to begin with. Why would we care?
>
> Of course nobody cares about that. We care about actual errors that we
> now know we won't be notified of. Imagine if we didn't get the memo that
> a piece of data is corrupt, and imagine the reaction of RAS folk.
>
> And I know the counter to that is a panic() is much more likely to cause
> data corruption, and we're trading one piece of crap for an even
> stinkier one. Whatever we end up doing, we have to do better than
> silence errors and pretend nothing happened.
>
>
> > Yes, yes, PCI bridges have the ability to return errors in accesses to
> > non-existent devices. But that was always bogus, and is never useful.
> > The whole "you get an interrupt or NMI on a bad access" is simply a
> > horribly broken model. It's not useful.
> >
> > We already have long depended on hotplug drivers noticing the "oh, I'm
> > getting all-ff returns, the device may be gone". It's usually trivial,
> > and works a whole lot better.
>
> And that's been working great, hasn't it? I think you're thinking
> strictly about hotplug. There are other situations where things are all
> F'd, but the hardware isn't sending all F's. (example: ECRC errors)
>
>
> > It's not an error. Trying to force it to be an NMI or SCI or machine
> > check is bogus. It causes horrendous pain, because asynchronous
> > reporting doesn't work reliably anyway, and *synchronous* reporting is
> > impossible to sanely handle without crazy problems.
> >
> > So the only sane model for hotplug devices is "IO still works, and
> > returns all ones". Maybe with an async one-time and *recoverable*
> > machine check or other reporting the access after the fact.
>
> Exactly!!! A notification (not calling it an 'error') that something
> unusual has happened is good. Treating these things like errors is so
> obvious, even a caveman wouldn't do it.
> In a world with FFS, we don't always get to have that model. Oh, FFS!
>
>
> > Anything else is simply broken. It would be broken even if firmware
> > wasn't involved, but obviously firmware people tend to often make a
> > bad situation even worse.
>
> Linus, be nice to firmware people. I've met a few, and I can vouch that
> they're very kind and nice. They're also very scared, especially when OS
> people want to ask them a few questions.
>
> I think FFS should get out of the way when OS advertises it's capable of
> handling XYZ. There are some good arguments why this hasn't happened,
> but I won't get into details. I do think it's unlikely that machines
> will be moving back to an OS-controlled model.
>
> And Linus, keep in mind, when these machines were developed, OSes
> couldn't handle recovery properly. None of this was ever an issue. It's
> our fault that we've changed the OS after the machines are on the market.
>
> Alex

I can't tell where you're going with this. It doesn't sound like you're
talking about hotplug anymore, at least.