[PATCH 0/2 v2] iommu: amd: Fix intremap IO_PAGE_FAULT for VMs

From: Suravee Suthikulpanit
Date: Thu Sep 03 2020 - 05:36:48 EST


Interrupt remapping IO_PAGE_FAULT has been observed under system w/
large number of VMs w/ pass-through devices. This can be reproduced with
64 VMs + 64 pass-through VFs of Mellanox MT28800 Family [ConnectX-5 Ex],
where each VM runs small-packet netperf test via the pass-through device
to the netserver running on the host. All VMs are running in reboot loop,
to trigger IRTE updates.

In addition, to accelerate the failure, irqbalance is triggered periodically
(e.g. 1-5 sec), which should generate large amount of updates to IRTE.
This setup generally triggers IO_PAGE_FAULT within 3-4 hours.

Investigation has shown that the issue is in the code to update IRTE
while remapping is enabled. Please see patch 2/2 for detail discussion.

This serires has been tested running in the setup mentioned above
upto 96 hours w/o seeing issues.

Changes from v1 (https://lkml.org/lkml/2020/9/2/26)
* Fix typo in comments and commit messages
* Fix logic to check for X86_FEATURE_CX16 support in patch 2/2

Thanks,
Suravee

Suravee Suthikulpanit (2):
iommu: amd: Restore IRTE.RemapEn bit after programming IRTE
iommu: amd: Use cmpxchg_double() when updating 128-bit IRTE

drivers/iommu/amd/Kconfig | 2 +-
drivers/iommu/amd/init.c | 21 +++++++++++++++++++--
drivers/iommu/amd/iommu.c | 19 +++++++++++++++----
3 files changed, 35 insertions(+), 7 deletions(-)

--
2.17.1