Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
From: Bjorn Helgaas
Date: Wed Oct 22 2014 - 14:26:39 EST
On Wed, Oct 22, 2014 at 7:21 AM, Joerg Roedel <joro@xxxxxxxxxx> wrote:
> Hi Bjorn,
>
> On Tue, Oct 21, 2014 at 08:16:46PM -0600, Bjorn Helgaas wrote:
>> I was looking at Zhen-Hua's recent patches, trying to figure out if I
>> need to do anything with them. Resetting devices in the old kernel
>> seems like a non-starter. Resetting devices in the new kernel, ...,
>> well, maybe. It seems ugly, and it seems like the sort of problem
>> that IOMMUs are designed to solve.
>
> Actually resetting the devices in the kdump kernel would be one of the
> better solutions for this problem. When we have a generic way to stop
> all in-flight DMA from the PCI endpoints we could safely disable and
> then re-enable the IOMMU.
>
>> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <joro@xxxxxxxxxx> wrote:
>> > That is a solution to prevent the in-flight DMA failures. But what
>> > happens when there is some in-flight DMA to a disk to write some inodes
>> > or a new superblock. Then this scratch address-space may cause
>> > filesystem corruption at worst.
>>
>> This in-flight DMA is from a device programmed by the old kernel, and
>> it would be reading data from the old kernel's buffers. I think
>> you're suggesting that we might want that DMA read to complete so the
>> device can update filesystem metadata?
>
> Well, it is not about updating filesystem metadata. In the kdump kernel
> we have these options:
>
> 1) Disable the IOMMU. Problem here is, that DMA is now
> untranslated, so that any in-flight DMA might read or write
> from a random location in memory, corrupting the kdump or
> even the new kexec kernel memory. So this is a non-starter.
Agreed (at least if the IOMMU was enabled in the crashed kernel).
> 2) Re-program the IOMMU to block all DMA. This is safer as it
> does not corrupt any data in memory. But some devices react
> very poorly on a master abort from the IOMMU, so bad that the
> driver in the kdump kernel fails to initialize that device.
> In this case taking dump itself might fail (and often does,
> according to reports)
Sounds like an option, even though broken devices work poorly.
> 3) To prevent master aborts like in option (2), David suggested
> to map the whole DMA address space to a scratch page. This
> also prevents system memory corruption and the master aborts.
> But the problem is, that in-flight DMA will now read all
> zeros. This can corrupt disk and network data, at worst it
> nulls out the superblocks of your filesystem and you lose all
> data. So this is not an option too.
Ah, yes, I see your point now. This allows corrupted data, e.g., all
zeroes, to be written to disk or network after the kernel crash. I
agree; this doesn't sound like a good option.
And the proposal below is a 4th option (leave IOMMU enabled, reusing
crashed kernel's mappings until drivers make new mappings).
> What we currently do in the VT-d driver is a mixture of (1) and (2). The
> VT-d driver disables the IOMMU hardware (opening a race window for
> memory data corruption), re-initializes it to reject any ongoing DMA
> (which causes master aborts for in-flight DMA) and re-enables the IOMMU
> hardware.
>
> But this strategy fails in heavy IO environments quite often and we look
> into ways to solve the problem, or at least improve the current
> situation as good as we can.
>
> I talked to David about this at LPC and we came up with basically this
> procedure:
>
> 1. If the VT-d driver finds the IOMMU enabled, it reuses its
> root-context table. This way the IOMMU must not be disabled
> and re-enabled, eliminating the race we have now.
> Other data structures like the context-entries are copied
> over from the old kernel. We basically keep all mappings
> from the old kernel, allowing any in-flight DMA to succeed.
> No memory or disk data corruption.
If the crashed kernel had corrupted memory, couldn't an in-flight DMA
read that corrupted data from memory and write it to disk?
I guess you could argue that this is merely a race, and the in-flight
DMA could just as easily have happened before the kernel crash, so
there's always a window and the only question is whether it closes
when the IOMMU driver starts up or when the device driver starts up.
> The issue here is, that we modify data from the old kernel
> which is about to be dumped. But there are ways to work
> around that.
>
> 2. When a device driver issues the first dma_map command for a
> device, we assign a new and empty page-table, thus removing
> all mappings from the old kernel for the device.
> Rationale is, that at this point the device driver should
> have reset the device to a point where all in-flight DMA is
> canceled.
>
> This approach goes into the same direction as Bill Sumners patch-set,
> which Li took over. But it goes not as far as keeping the old mappings
> while the kdump kernel is still working with the devices (which might
> introduce new issues and corner cases).
>
>> > So with this in mind I would prefer initially taking over the
>> > page-tables from the old kernel before the device drivers re-initialize
>> > the devices.
>>
>> This makes the dump kernel more dependent on data from the old kernel,
>> which we obviously want to avoid when possible.
>
> Sure, but this is not really possible here (unless we have a generic and
> reliable way to reset all PCI endpoint devices and cancel all in-flight
> DMA before we disable the IOMMU in the kdump kernel).
> Otherwise we always risk data corruption somewhere, in system memory or
> on disk.
>
>> I didn't find the previous discussion where pointing every virtual bus
>> address at the same physical scratch page was proposed. Why was that
>> better than programming the IOMMU to reject every DMA?
>
> As I said, the problem is that this causes master aborts which some
> devices can't recover from, so that the device driver in the kdump
> kernel fails to initialize the device.
Yes, thanks for making that explicit again.
Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/