Re: 32-bit dma allocations on 64-bit platforms

From: Terence Ripperda
Date: Thu Jun 24 2004 - 10:50:41 EST


On Wed, Jun 23, 2004 at 04:46:44PM -0700, ak@xxxxxx wrote:
> pci_alloc_consistent is limited to 16MB, but so far nobody has really
> complained about that. If that should be a real issue we can make
> it allocate from the swiotlb pool, which is usually 64MB (and can
> be made bigger at boot time)

In all of the cases I've seen, it defaults to 4M. in swiotlb.c,
io_tlb_nslabs defaults to 1024, * PAGE_SIZE == 4194304.


> Would that work for you too BTW ? How much memory do you expect
> to need?

potentially. our currently pending release uses pci_map_sg, which
relies on swiotlb for em64t systems. it "works", but we have some ugly
hacks and were hoping to get away from using it (at least in it's
current form).

here's some of the problems we encountered:

probably the biggest problem is that the size is way too small for our
needs (more on our memory usage shortly). this is compounded by the
the swiotlb code throwing a kernel panic when it can't allocate
memory. and if the panic doesn't halt the machine, the routine returns
a random value off the stack as the dma_addr_t.

for this reason, we have an ugly hack that notices that swiotlb is
enabled (just checks if swiotlb is set) and prints a warning to the user
to bump up the size of the swiotlb to 16384, or 64M.

also, the proper usage of using the bounce buffers and calling
pci_dma_sync_* would be a performance killer for us. we stream a
considerable amount of data to the gpu per second (on the order of
100s of Megs a second), so having to do an additional memcpy would
reduce performance considerably, in some cases between 30-50%.

for this reason, we detect when the dma_addr != phys_addr, and map the
dma_addr directly to opengl to avoid the copy. I know this is ugly,
and that's one of the things I'd really like to get away from.

finally, our driver already uses a considerable amount of memory. by
definition, the swiotlb interface doubles that memory usage. if our
driver used swiotlb correctly (as in didn't know about swiotlb and
always called pci_dma_sync_*), we'd lock down the physical addresses
opengl writes to, since they're normally used directly for dma, plus
the pages allocated from the swiotlb would be locked down (currently
we manually do this, but if swiotlb is supposed to be transparent to
the driver and used for dma, I assume it must already be locked down,
perhaps by definition of being bootmem?). this means not only is the
memory usage double, but it's all locked down and unpageable.

in this case, it almost would make more sense to treat the bootmem
allocated for swiotlb as a pool of 32-bit memory that can be directly
allocated from, rather than as bounce buffers. I don't know that this
would be an acceptable interface though.

but if we could come up with reasonable solutions to these problems,
this may work.

> drawback is that the swiotlb pool is not unified with the rest of the
> VM, so tying up too much memory there is quite unfriendly.
> e.g. if you you can use up 1GB then i wouldn't consider this suitable,
> for 128MB max it may be possible.

I checked with our opengl developers on this. by default, we allocate
~64k for X's push buffer and ~1M per opengl client for their push
buffer. on quadro/workstation parts, we allocate 20M for the first
opengl client, then ~1M per client after that.

in addition to the push buffer, there is a lot of data that apps dump
to the push buffer. this includes textures, vertex buffers, display
lists, etc. the amount of memory used for this varies greatly from app
to app. the 20M listed above includes the push buffer and memory for
these buffers (I think workstation apps tend to push a lot more
pre-processed vertex data to the gpu).

note that most agp apertures these days are in the 128M - 1024M range,
and there are times that we exhaust that memory on the low end. I
think our driver is greedy when trying to allocate memory for
performance reasons, but has good fallback cases. so being somewhat
limited on resources isn't too bad, just so long as the kernel doesn't
panic instead of falling the memory allocation.

I would think that 64M or 128M would be good. a nice feature of
swiotlb is the ability to tune it at boot. so if a workstation user
found they really did need more memory for performance, they could
tweak that value up for themselves.

also remember future growth. PCI-E has something like 20/24 lanes that
can be split among multiple PCI-E slots. Alienware has already
announced multi-card products, and it's likely multi-card products
will be more readily available on PCI-E, since the slots should have
equivalent bandwidth (unlike AGP+PCI).

nvidia has also had workstation parts in the past with 2 gpus and a
bridge chip. each of these gpus ran twinview, so each card drove 4
monitors. these were pci cards, and some crazy vendors had 4+ of these
cards in a machine driving many monitors. this just pushes the memory
requirements up in special circumstances.

Thanks,
Terence
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/