Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges

From: Mike Habeck
Date: Thu Mar 31 2011 - 21:07:42 EST




Chris Wright wrote:
* Mike Habeck (habeck@xxxxxxx) wrote:
On 03/31/2011 06:25 PM, Mike Travis wrote:
I'll probably need help from our Hardware PCI Engineer to help explain
this further, though here's a pointer to an earlier email thread:

http://marc.info/?l=linux-kernel&m=129259816925973&w=2

I'll also dig out the specs you're asking for.

Thanks,
Mike

Chris Wright wrote:
* Mike Travis (travis@xxxxxxx) wrote:
Chris - did you have any comment on this patch?
It doesn't actually look right to me. It means that particular range
is no longer reserved. But perhaps I've misunderstood something.

Mike Travis wrote:
dmar_init_reserved_ranges() reserves the card's MMIO ranges to
prevent handing out a DMA map that would overlap with the MMIO range.
The problem while the Nvidia GPU has 64bit BARs, it's capable of
receiving > 40bit PIOs, but can't generate > 40bit DMAs.
I don't undertand what you mean here.
What Mike is getting at is there is no reason to reserve the MMIO
range if it's greater than the dma_mask, given the MMIO range is
outside of what the IOVA code will ever hand back to the IOMMU
code. In this case the nVidia card has a 64bit BAR and is assigned
the MMIO range [0xf8200000000 - 0xf820fffffff]. But the Nvidia
card can only generate a 40bit DMA (thus has a 40bit dma_mask). If
the IOVA code honors the limit_pfn (i.e., dma_mask) passed in it
will never hand back a >40bit address back to the IOMMU code. Thus
there is no reason to reserve the cards MMIO range if it is greater
than the dma_mask. (And that is what the patch is doing).

The reserved ranges are for all devices. Another device with a 64bit
dma_mask could get that region if it's not properly reserved. The
driver would then program that device to dma to an address to is an
alias to a MMIO region. The memory transaction travels up towards
root...and sees the MMIO range in some bridge and would go straight down
to the GPU.

Chris,

OK, I understand now what you meant by the patch possibly causing
the DMA transaction to become a peer to peer transaction. Mike and
I will have to rethink this one. Thanks for your input.

-mike



More below,,,

So when the iommu code reserves these MMIO ranges a > 40bit
entry ends up getting in the rbtree. On a UV test system with
the Nvidia cards, the BARs are:

0001:36:00.0 VGA compatible controller: nVidia Corporation
GT200GL Region 0: Memory at 92000000 (32-bit, non-prefetchable)
[size=16M]
Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]

So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
maps get added and deleted from the rbtree we can end up getting a cached
entry to this 0xf8200000000 entry... this is what results in the code
handing out the invalid DMA map of 0xf81fffff000:

[ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]

The IOVA code needs to better honor the "limit_pfn" when allocating
these maps.
This means we could get the MMIO address range (it's no longer reserved).
Not true, the MMIO address is greater than the dma_mask (i.e., the
limit_pfn passed into alloc_iova()) thus the IOVA code will never
hand back that address range given it's greater than the dma_mask).

Well, as you guys are seeing, the iova allocation code is making the
assumption that if the range is in the tree, it's valid. And it is
handing out an address that's too large.

It seems to me the DMA transaction would then become a peer to peer
transaction if ACS is not enabled, which could show up as random register
write in that GPUs 256M BAR (i.e. broken).

The iova allocation should not hand out an address bigger than the
dma_mask. What is the device's dma_mask?
Agree. But there is a bug. The IOVA doesn't validate the limit_pfn
if it uses the cached entry. One could argue that it should validate
the limit_pfn, but then again a entry outside the limit_pfn should
have never got into the rbtree... (it got in due to the IOMMU's
dmar_init_reserved_ranges() adding it).

Yeah, I think it needs to be in the global reserved list. But perhaps
not copied into the domain specific iova. Or simply skipped on iova
allocation (don't just assume rb_last is <= dma_mask).

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/