Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)

From: Mina Almasry
Date: Tue Jul 25 2023 - 00:04:44 EST


On Mon, Jul 24, 2023 at 7:56 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:
>
>
>
> On 17/07/2023 03.53, Mina Almasry wrote:
> > On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> >>
> >> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:
> >>
> >>> Once the skb frags with struct new_abstraction are in the TCP stack,
> >>> they will need some special handling in code accessing the frags. But
> >>> my RFC already addressed that somewhat because the frags were
> >>> inaccessible in that case. In this case the frags will be both
> >>> inaccessible and will not be struct pages at all (things like
> >>> get_page() will not work), so more special handling will be required,
> >>> maybe.
> >>
> >> It seems sort of reasonable, though there will be interesting concerns
> >> about coherence and synchronization with generial purpose DMABUFs that
> >> will need tackling.
> >>
> >> Still it is such a lot of churn and weridness in the netdev side, I
> >> think you'd do well to present an actual full application as
> >> justification.
> >>
> >> Yes, you showed you can stick unordered TCP data frags into GPU memory
> >> sort of quickly, but have you gone further with this to actually show
> >> it is useful for a real world GPU centric application?
> >>
> >> BTW your cover letter said 96% utilization, the usual server
> >> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
> >> TCP BW with this?
> >>
> >
> > I do notice that the number of NICs is missing from our public
> > documentation so far, so I will refrain from specifying how many NICs
> > are on those A3 VMs until the information is public. But I think I can
> > confirm that your general thinking is correct, the perf that we're
> > getting is 96.6% line rate of each GPU/NIC pair,
>
> What do you mean by 96.6% "line rate".
> Is is the Ethernet line-rate?
>

Yes I believe this is the ethernet line-rate. I.e. the 200 Gbits/sec
that my NICs run.

> Is the measured throughput the measured TCP data "goodput"?

Yes, it is goodput. Roughly I believe we add up the return values of
recvmsg() and divide that number by time (very roughly, I think).

> Assuming
> - MTU 1500 bytes (1514 on wire).
> - Ethernet header 14 bytes
> - IP header 20 bytes
> - TCP header 20 bytes
>
> Due to header overhead the goodput will be approx 96.4%.
> - (1514-(14+20+20))/1514 = 0.9643
> - (Not taking Ethernet interframe gap into account).
>
> Thus, maybe you have hit Ethernet wire line-rate already?

My MTU is 8244 actually, which gives me 8192 mss/payload for my
connections. By my math the theoretical max would be 1 - 52/8244 =
~99.3%. So it looks like I'm dropping ~3% line rate somewhere in the
implementation.

>
> > and scales linearly
> > for each NIC/GPU pair we've tested with so far. Line rate of each
> > NIC/GPU pair is 200 Gb/sec.
> >
> > So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.
>
> Lets keep our units straight.
> Here you mean 1545 Gbit/sec, which is 193 GBytes/s
>

Yes! Sorry! I definitely meant 1545 Gbits/sec, sorry!

> > If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec
>
> Here you mean 384 Gbit/sec, which is 48 GBytes/sec.
>

Correct again!

> > ...
> > etc.
> >
>
> These massive throughput numbers are important, because they *exceed*
> the physical host RAM/DIMM memory speeds.
>
> This is the *real argument* why software cannot afford to do a single
> copy of the data from host-RAM into GPU-memory, because the CPU memory
> throughput to DRAM/DIMM are insufficient.
>
> My testlab CPU E5-1650 have 4 DIMM slots DDR4
> - Data Width: 64 bits (= 8 bytes)
> - Configured Memory Speed: 2400 MT/s
> - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4)
>
> Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for
> the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs).
>
> When testing this with lmbench tool bw_mem, the results (below
> signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as
> exceeding L3 cache size. In practice it looks like main memory is
> limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s)
>
> Given DDIO can deliver network packets into L3, I also tried to figure
> out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec
> (339 Gbit/s), in hopes that it would be enough, but it was not.
>
>

Yes, avoiding any memory speed bottleneck as you note is important,
but the second point mentioned in my cover letter is also impactful:

" Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
of the PCIe tree, compared to traditional path which sends data through the
root complex."

Depending on the hardware, this is a bottleneck that we avoid with
device memory TCP. NIC/GPU copies occupy the PCIe link bandwidth. In a
hierarchy like this:

root complex
| (uplink)
PCIe switch
/ \
NIC GPU

I believe the uplink from the PCIe switch to the root complex is used
up 2 times for TX and 2 times for RX if the data needs to go through
host memory:

RX: NIC -> root complex -> GPU
TX: GPU -> root complex -> NIC

With device memory TCP, and enabling PCI P2P communication between the
devices under the same PCIe switch, the payload flows directly from/to
the NIC/GPU through the PCIe switch, and the payload never goes to the
root complex, alleviating pressure/bottleneck on that link between the
PCIe switch/root complex. I believe this is a core reason we're able
to scale throughput linearly with NIC/GPU pairs, because we don't
stress share uplink connections and all the payload data transfer
happens beneath the PCIe switch.

> --Jesper
> (data below signature)
>
> CPU under test:
>
> $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2
> model name : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
> cache size : 15360 KB
>
>
> Providing some cmdline outputs from lmbench "bw_mem" tool.
> (Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second)
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd
> 256.00 14924.50
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr
> 256.00 9895.25
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr
> 256.00 9737.54
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy
> 256.00 12462.88
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero
> 256.00 14869.89
>
>
> Next output shows reducing size below L3 cache size, which shows an
> increase in speed, likely the L3 bandwidth.
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd
> 64.00 14840.58
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd
> 32.00 14823.97
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd
> 16.00 24743.86
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd
> 8.00 40852.26
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd
> 4.00 42545.65
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd
> 2.00 42447.82
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd
> 1.00 42447.82
>


--
Thanks,
Mina