Longer term the right way to handle this would be likely to use
POSIX AIO on sockets. With that interface it would be easier
to keep long queues of data in flight, which would be best for
the DMA engine.
In addition to helping speed up network RX, I would like to see how possible it is to experiment with IOAT uses outside of networking. Sample ideas: VM page pre-zeroing. ATA PIO data xfers (async copy to static buffer, to dramatically shorten length of kmap+irqsave time). Extremely large memcpy() calls.
Another proposal was swiotlb.
But it's not clear it's a good idea: a lot of these applications prefer to have the target in cache. And IOAT will force it out of cache.
Additionally, current IOAT is memory->memory. I would love to be able to convince Intel to add transforms and checksums, to enable offload of memory->transform->memory and memory->checksum->result operations like sha-{1,256} hashing[1], crc32*, aes crypto, and other highly common operations. All of that could be made async.
I remember the registers in the Amiga Blitter for this and I'm
still scared... Maybe it's better to keep it simple.