Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8
From: Mario 'BitKoenig' Holbe
Date: Fri Mar 11 2011 - 13:04:17 EST
I was long pondering whether to reply to this or not, but sorry, I
couldn't resist.
On Thu, Mar 10, 2011 at 05:18:42PM -0800, Andi Kleen <ak@xxxxxxxxxxxxxxx> wrote:
> You probably need to find some way
> to make pcrypt (parallel crypt layer) work for dmcrypt. That may
> actually give you more speedup too than your old hack because
> it can balance over more cores.
"my" old "hack" balances well as long as the number of stripes is equal
or greater than the number of cores.
And for my specific case... it's hard to balance over more than 4 cores
on a Core2Quad :)
> Or get a system with AES-NI -- that usually solves it too.
Honi soit qui mal y pense.
Of course I understand that Intel's primary goal is to sell new
hardware and hence I understand that you are required to tell this to
me. However, based on the AES-NI benchmarks from the linux-crypto ML,
even with AES-NI it would be hard to impossible to re-gain my
(non-AES-NI!) pre-.38 performance with the .38 dm-crypt parallelization
approach.
> Frankly I don't think it's a very interesting case, the majority
> of workloads are not like that.
Well, I'm not sure if we understand each other.
Probably my use case is a little bit special, but that's not the point.
The main point is that the .38 dm-crypt parallelization approach does
kill performance on *each* RAID0-over-dm-crypt setup. A setup which, I
believe, is not that uncommon as you may believe because it was the only
way to spread disk-encryption over multiple CPUs until .38.
Up to .37 due to the CPU-inaffinity accessing (reading or writing) one
stripe in the RAID0 did always spread over min(#core, #kcryptd) cores.
Now with .38 the same access will always only utilize one single core
because all the chunks of the stripe are (obviously) accessed on the
same core and hence either the multiple underlying kcryptds block each
other now with the old approach or with dm-crypt-over-RAID0 there is
only one kcryptd involved in serving one request on one core. Hence, for
single requests the new approach always decreases throughput and
increases latency. The latency-increase holds even for multi-process
workloads.
For your approach to at least match up the old one it requires
min(#core, #kcryptd) parallel requests all the time assuming latency
doesn't matter and disk seek time to be zero (now you tell me to get
X25s, right? :)).
Mario
--
There are two major products that come from Berkeley: LSD and UNIX.
We don't believe this to be a coincidence. -- Jeremy S. Anderson
Attachment:
signature.asc
Description: Digital signature