RE: RFC: IOMMU/AMD: Error Handling

From: Duran, Leo
Date: Mon Apr 29 2013 - 18:31:35 EST


I see... I suppose the trick is going to be how to 'filter' this non intended behavior (once, during OS boot).
Thanks,
Leo.

> -----Original Message-----
> From: Don Dutile [mailto:ddutile@xxxxxxxxxx]
> Sent: Monday, April 29, 2013 4:42 PM
> To: Duran, Leo
> Cc: Suthikulpanit, Suravee; iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx
> Subject: Re: RFC: IOMMU/AMD: Error Handling
>
> On 04/29/2013 04:34 PM, Duran, Leo wrote:
> > I'm wondering if resetting the IOMMU at init-time (once) would clear any
> BIOS induced noise.
> > Leo
> >
> Well, depends what you mean by 'reset'....
> (a) setting it up for OS use is effectively a reset, but doesn't quiesce a device
> doing dma reads of a (bios-setup) queue. then the noisy messages begin
> (b) disable the iommu, and then the dma just occurs... and bad for writes,
> potentially.
>
> Similar issue is being reported & worked for kdump, where device are still
> doing DMA while the system is trying to 'reset' to the kexec'd kernel, and
> take a crash dump.
>
> Solution: stop devices from doing dma... but some you _want_ enabled
> throughout...
> like keyboard & mouse via usb controller, so you get to pick os from
> grub... not so for kexec...
>
> so, again, for isolation faults.... let the hw do its job -- isolate and
> throttle/silence the fault messages on a per-device, time-duration heuristic
> so the system can get through boot-up where enough OS is init'd (drivers
> started) to stop the temporary noise.
>
> >> -----Original Message-----
> >> From: iommu-bounces@xxxxxxxxxxxxxxxxxxxxxxxxxx [mailto:iommu-
> >> bounces@xxxxxxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Don Dutile
> >> Sent: Monday, April 29, 2013 3:10 PM
> >> To: Suthikulpanit, Suravee
> >> Cc: iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> >> Subject: Re: RFC: IOMMU/AMD: Error Handling
> >>
> >> On 04/29/2013 03:45 PM, Suravee Suthikulanit wrote:
> >>> Joerg,
> >>>
> >>> We are in the process of implementing AMD IOMMU error handling, and
> >>> I
> >> would like some comments from you and the community.
> >>>
> >>> Currently, the AMD IOMMU driver only reports events from the event
> >>> log
> >> in the dmesg, and does not try to handle them in case of errors. AMD
> >> IOMMU errors can be categorized as device-specific errors and IOMMU
> >> errors.
> >>>
> >>> 1. For IOMMU errors such as:
> >>> - DEV_TAB_HADWARE_ERROR
> >>> - PAGE_TAB_ERROR
> >>> - COMMAND_HARDWARE_ERROR
> >>> If the error is detected during IOMMU initialization, we could
> >>> disable
> >> IOMMU and proceed. If the error occurs after IOMMU is initialized, we
> >> won't be able to recover from this, and might need to result in panic.
> >>>
> >>> 2. For device-specific errors such as:
> >>> - ILLEGAL_DEV_TABLE_ENTRY
> >>> - IO_PAGE_FAULT
> >>> - INVALDE_DEVICE_REQUEST
> >>> We think the AMD IOMMU driver should try to isolate the device. This
> >> involves blocking device transactions at IOMMU DTE and tries to
> >> disable the device (e.g. calling the remove(struct pci_dev *pdev)
> >> interface generally provides by device drivers). This could prevents
> >> the device from continuing to fail and to risk of system instability.
> >>>
> >> disabling the device is not an option.
> >> We've seen mis-configured ACPI tables generate storms of invalide dte
> >> messages after iommu setup but before they are cleared up when the OS
> >> driver is started& resets the device. The original storm is from
> >> bios-use of IOMMU with a device.
> >> I'd recommend creating a filter that prevents further logging from a
> >> device for 5 mins at a time if a storm of DTE-related errors are seen.
> >> by definition, the DMA is blocked from corrupting/changing memory, so
> >> isolation has been established; keeping the failure log from
> >> consuming the system is the needed fix.
> >>
> >>> 3. In case of posted memory write transaction, device driver might
> >>> not be
> >> aware that the transaction has failed and blocked at IOMMU. If there
> >> is no HW IOMMU, I believe this is handled by PCI error handling code.
> >> If the IOMMU hardware reporth such case, could this potentially
> >> leverage the Linux IOMMU fault handling interface,
> >> iommu_set_fault_handler() and report_iommu_fault(), to communicate
> to device driver or PCI driver?
> >>>
> >> Wondering if you could use AER-like callback mechanism so a driver
> >> can be invoked when IOMMU error occurs, so the device driver can
> >> quiesce or reset the device if it deems it transient.
> >>
> >>
> >>> Any feedback or comments are appreciated.
> >>>
> >>> Thank you,
> >>> Suravee
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> iommu mailing list
> >>> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
> >>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >>
> >> _______________________________________________
> >> iommu mailing list
> >> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
> >> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >
> >
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/