RE: The ultimate TOE design

From: Leonid Grossman
Date: Thu Sep 16 2004 - 10:13:58 EST




> -----Original Message-----
> From: jamal [mailto:hadi@xxxxxxxxxx]
> Sent: Thursday, September 16, 2004 2:58 AM
> To: Leonid Grossman
> Cc: 'Jeff Garzik'; 'David S. Miller';
> alan@xxxxxxxxxxxxxxxxxxx; paul@xxxxxxxx; netdev@xxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx
> Subject: RE: The ultimate TOE design
>
> On Thu, 2004-09-16 at 01:25, Leonid Grossman wrote:
> >
> > > -----Original Message-----
> > > From: jamal [mailto:hadi@xxxxxxxxxx]
>
> > > On a serious note, I think that PCI-express (if it lives upto its
> > > expectation) will demolish dreams of a lot of these TOE
> investments.
> > > Our problem is NOT the CPU right now (80% idle processing 450Kpps
> > > forwarding). Bus and memory distance/latency are.
> >
> > In servers, both bottlenecks are there - if you look at the cost of
> > TCP and filesystem processing at 10GbE, CPU is a huge problem (and
> > will be for foreseeable future), even for fastest 64-bit systems.
>
> True, but with the bus contention being a non-issue you got
> more of that xeon being available for use (lets say i can use
> 50% more of its capacity then i can do more). IOW, it becomes
> a compute capacity problem mostly - one that you should in
> theory be able to throw more CPU at. SMT (the way power5 and
> some of the network processors do it[1]) should go a long way
> to address both additional compute and hardware threading to
> work around memory latencies. With PCI-express, compute power
> in mini-clustering in the form of AS
> (http://www.asi-sig.org/home) is being plotted as we speak.
> To sumarize: The problem to solve in 24 months maybe 100Gige.
>
> > I agree though that bus and memory are bigger issues, this
> is exactly
> > the reason for all these RDMA over Ethernet investments :-)
>
> And AS does a damn good job at specing all those RDMA
> requirements; my view is that intel is going to build them
> chips - so it can be done on a
> $5 board off the pacific rim. This takes most of the small
> players out of the market.
>
> > Anyways, did not mean to start an argument - with all the
> new CPU, bus
> > and HBA technologies coming to the market it will be another 18-24
> > months before we know what works and what doesn't...
>
> Agreed. Would you like to invest on something that will obsoleted in
> 18-24 months though? OR even not obsoleted, but holds that
> uncertainty?
> I think thats the risk facing you when you are in the offload
> bussiness.

Well.. Any business has risks, this one doesn't seem to be higher than
others :-)
I view 18-26 mo timeframe as a start of the offload mass-adoption, not the
end of it.

In our tests, the bus contention and the %cpu are mostly orthogonal
problems; PCI-X DDR and PCI-Express will help but only to a point.
(BTW this is all related to the higher end systems - 2-4 way and above,
running 10GbE NICs. Client is a different story, cpu is mostly "free"
there).
My sense is that (unlike on previous cycles) the "slow host, fast network"
scenario is here to stay for a long while, and will have to be addressed one
way or another - whether it is a full TOE+RDMA offload in a longer run, or
an improvement to "static" offloads.
In server space, applications will never be happy with less than 80% cpu.

Leonid

>
> Here are results for Hifn 7956 ref board on 2.6GHz P4 (HT)
> system, kernel 2.6.6 SMP as compared to a s/ware only setup
> on same machine.
> [Name of tester withheld to protect privacy].
>
> first column - algo, second - packet size, third - time in us
> spend by hw crypto, forth - time in us spent by sw crypto:
>
> des 64: 28 3
> des 128: 29 6
> des 192: 33 9
> des 256: 33 12
> des 320: 37 15
> des 384: 38 18
> des 448: 41 21
> des 512: 42 23
> des 576: 45 26
> des 640: 46 29
> des 704: 49 33
> des 768: 50 35
> des 832: 53 38
> des 896: 54 41
> des 960: 57 44
> des 1024: 58 47
> des 1088: 61 50
> des 1152: 62 53
> des 1216: 66 56
> des 1280: 66 59
> des 1344: 70 62
> des 1408: 71 65
> des 1472: 74 68
> des3_ede 64: 28 6
> des3_ede 128: 30 13
> des3_ede 192: 34 20
> des3_ede 256: 43 26
> des3_ede 320: 38 33
> des3_ede 384: 48 40
> des3_ede 448: 44 45
> des3_ede 512: 54 53
> des3_ede 576: 50 60
> des3_ede 640: 59 67
> des3_ede 704: 55 74
> des3_ede 768: 66 78
> des3_ede 832: 61 85
> des3_ede 896: 72 94
> des3_ede 960: 67 100
> des3_ede 1024: 77 107
> des3_ede 1088: 73 114
> des3_ede 1152: 82 121
> des3_ede 1216: 79 127
> des3_ede 1280: 88 128
> des3_ede 1344: 84 135
> des3_ede 1408: 94 147
> des3_ede 1472: 90 153
> aes 64: 28 2
> aes 192: 33 6
> aes 320: 37 10
> aes 448: 46 15
> aes 576: 53 19
> aes 704: 53 23
> aes 832: 65 28
> aes 960: 66 32
> aes 1088: 71 37
> aes 1216: 80 41
> aes 1344: 83 45
> aes 1472: 92 50
>
> Moral of the data above: The 2.6Ghz is already showing signs
> of obsoleting the hifn crypto offloader[2]. I think it took
> less than a year for it to happen.
>
> cheers,
> jamal
>
> [1] I also like the MIPS.com approach to SMT
>
> [2] There are actually issues with some of the crypto
> offloading in Linux; however this does serve as a good example.
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/