Re: [External] Re: [PATCH v2] iommu/vt-d: fix system hang on reboot -f

From: Ethan Zhao
Date: Thu Feb 27 2025 - 23:34:44 EST



在 2025/2/28 10:18, yunhui cui 写道:
Hi All,

On Fri, Feb 28, 2025 at 8:51 AM Ethan Zhao <etzhao1900@xxxxxxxxx> wrote:

On 2/28/2025 4:38 AM, Jason Gunthorpe wrote:
On Thu, Feb 27, 2025 at 08:40:31AM +0800, Ethan Zhao wrote:
On 2/26/2025 9:04 PM, Jason Gunthorpe wrote:
On Wed, Feb 26, 2025 at 01:55:28PM +0800, Ethan Zhao wrote:
Provided the system does not respond to those events when this function
is called, it's fine to remove the lock.
I agree.
I think it is running the destruction of the iommu far too late in the
process. IMHO it should be done after all the drivers have been
shutdown, before the CPUs go single threaded.
Hmm... so far it is fine, the iommu_shutdown only has a little work to
do, disable the translation, the PMR disabling is just backward compatible,
was deprecated already. if we move it to one position where all CPUs are
cycling, we don't know what kind of user-land tasks left there (i.e. reboot -f
case), it would be hard to full-fill the requirement of Intel VT-d, no ongoing
transaction there on hardware when issue the translation disabling command.
There is no guarentee device dma is halted anyhow at this point either.
The -f option to reboot command, suggests data corruption possibility.although

the IOMMU strives not to cross the transaction boundaries of its address
translation.

over all, we should make the reboot -f function works. not to hang
there. meanwhile

it doesn't break anything else so far.
patch v1:
if (!down_write_trylock(&dmar_global_lock))
return;

We should also think about the corner case of reboot without -f option, there might be something

wrong with one of the devices there on other CPUs they don't release the lock and were shutdown

by NMI, the dmar_global_lock is held by one of them, so we just bail out here at at once ?

leaving the IOMMU translation etc in enabling status. Is it better to loop with 10 seconds timeout

to try the lock then make sure no better choice, then continue the disabling of IOMMU translation & PMR ?.

Even the reboot -f case, leaving the IOMMU translation in enabling status is worse choice.

So seems we shouldn't bail out immediately after one time trylock whatever.


patch v3:
/* All other CPUs were brought down, hotplug interrupts were disabled,
no lock and RCU checking needed anymore*/
list_for_each_entry(drhd, &dmar_drhd_units, list) {
iommu = drhd->iommu;
/* Disable PMRs explicitly here. */
iommu_disable_protect_mem_regions(iommu);
iommu_disable_translation(iommu);
}

Since we can remove down_write/up_write, it indicates that during the
IOMMU shutdown process, we are not particularly concerned about
whether others are accessing the critical section. Therefore, it can
be understood that Patch v1 can successfully acquire the lock and
proceed with subsequent operations. From this perspective, Patch v1 is
reasonable and can prevent situations where there is an actual owner
of dmar_global_lock, even though it does not perform a real IOMMU
shutdown.

However, if the IOMMU shutdown truly fails, IOMMU_WAIT_OP will trigger
a panic(). Removing down_write/up_write offers better maintainability
than Patch v1 (as we can retrieve the cause from the vmcore). But this
might not be significant, since the reboot could have been completed
successfully, and the occurrence of panic() might instead cause
confusion.

In summary, it is essential to complete the reboot action here rather
than causing the system to hang. So, which do you prefer, v1 or v3?

Would v3 miss something ? please comment.

normal reboot/shutdown without -f case, function works right without any data corrupt or

leaving device in confusion status, good or bad situation.

reboot -f case, functions works, strives to not corrupt data or leave device in confusion status.

...

Thanks,

Ethan



Thanks,

Ethan

Jason
Thanks,
Yunhui

--
"firm, enduring, strong, and long-lived"