Re: zram: per-cpu compression streams

From: Sergey Senozhatsky
Date: Wed Mar 23 2016 - 04:11:03 EST


( was "[PATCH] zram: export the number of available comp streams"
forked from http://marc.info/?l=linux-kernel&m=145860707516861 )

d'oh.... sorry, now actually forked.


Hello Minchan,

forked into a separate tread.

> On (03/22/16 09:39), Minchan Kim wrote:
> > zram_bvec_write()
> > {
> > *get_cpu_ptr(comp-stream);
> > zcomp_compress();
> > zs_malloc()
> > put_cpu_ptr(comp-stream);
> > }
> >
> > this, however, makes zsmalloc unhapy. pool has GFP_NOIO | __GFP_HIGHMEM
> > gfp, and GFP_NOIO is ___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM. this
> > __GFP_DIRECT_RECLAIM is in the conflict with per-cpu streams, because
> > per-cpu streams require disabled preemption (up until we copy stream
> > buffer to zspage). so what options do we have here... from the top of
> > my head (w/o a lot of thinking)...
>
> Indeed.
...
> How about this?
>
> zram_bvec_write()
> {
> retry:
> *get_cpu_ptr(comp-stream);
> zcomp_compress();
> handle = zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM| | GFP_NOWARN)
> if (!handle) {
> put_cpu_ptr(comp-stream);
> handle = zs_malloc(gfp);
> goto retry;
> }
> put_cpu_ptr(comp-stream);
> }

interesting. the retry jump should go higher, we have "user_mem = kmap_atomic(page)"
which we unmap right after compression, because a) we don't need
uncompressed memory anymore b) zs_malloc() can sleep and we can't have atomic
mapping around. the nasty thing here is is_partial_io(). we need to re-do

if (is_partial_io(bvec))
memcpy(uncmem + offset, user_mem + bvec-bv_offset,
bvec-bv_len);

once again in the worst case.

so zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM | GFP_NOWARN) so far can cause
double memcpy() and double compression. just to outline this.


the test.

I executed a number of iozone tests, on each iteration re-creating zram
device (3GB, LZO, EXT4. the box has 4 x86_64 CPUs).

$DEVICE_SZ=3G
$FREE_SPACE is 10% of $DEVICE_SZ
time ./iozone -t $i -R -r $((8*$i))K -s $((($DEVICE_SZ/$i - $FREE_SPACE)/(1024*1024)))M -I +Z


columns:

TEST MAX_STREAMS 4 MAX_STREAMS 8 PER_CPU STREAMS
====================================================================

Test #1 iozone -t 1 -R -r 8K -s 2764M -I +Z
Initial write 853492.31* 835868.50 839789.56
Rewrite 1642073.88 1657255.75 1693011.50*
Read 3384044.00* 3218727.25 3269109.50
Re-read 3389794.50* 3243187.00 3267422.25
Reverse Read 3209805.75* 3082040.00 3107957.25
Stride read 3100144.50* 2972280.25 2923155.25
Random read 2992249.75* 2874605.00 2854824.25
Mixed workload 2992274.75* 2878212.25 2883840.00
Random write 1471800.00 1452346.50 1515678.75*
Pwrite 802083.00 801627.31 820251.69*
Pread 3443495.00* 3308659.25 3302089.00
Fwrite 1880446.88 1838607.50 1909490.00*
Fread 3479614.75 3091634.75 6442964.50*
= real 1m4.170s 1m4.513s 1m4.123s
= user 0m0.559s 0m0.518s 0m0.511s
= sys 0m18.766s 0m19.264s 0m18.641s


Test #2 iozone -t 2 -R -r 16K -s 1228M -I +Z
Initial write 2102532.12 2051809.19 2419072.50*
Rewrite 2217024.25 2250930.00 3681559.00*
Read 7716933.25 7898759.00 8345507.75*
Re-read 7748487.75 7765282.25 8342367.50*
Reverse Read 7415254.25 7552637.25 7822691.75*
Stride read 7041909.50 7091049.25 7401273.00*
Random read 6205044.25 6738888.50 7232104.25*
Mixed workload 4582990.00 5271651.50 5361002.88*
Random write 2591893.62 2513729.88 3660774.38*
Pwrite 1873876.75 1909758.69 2087238.81*
Pread 4669850.00 4651121.56 4919588.44*
Fwrite 1937947.25 1940628.06 2034251.25*
Fread 9930319.00 9970078.00* 9831422.50
= real 0m53.844s 0m53.607s 0m52.528s
= user 0m0.273s 0m0.289s 0m0.280s
= sys 0m16.595s 0m16.478s 0m14.072s


Test #3 iozone -t 3 -R -r 24K -s 716M -I +Z
Initial write 3036567.50 2998918.25 3683853.00*
Rewrite 3402447.88 3415685.88 5054705.38*
Read 11767413.00* 11133789.50 11246497.25
Re-read 11797680.50* 11092592.00 11277382.00
Reverse Read 10828320.00* 10157665.50 10749055.00
Stride read 10532039.50* 9943521.75 10464700.25
Random read 10380365.75* 9807859.25 10234127.00
Mixed workload 8772132.50* 8415083.50 8457108.50
Random write 3364875.00 3310042.00 5059136.38*
Pwrite 2677290.25 2651309.50 3198166.25*
Pread 5221799.56* 4963050.69 4987293.78
Fwrite 2026887.56 2047679.00 2124199.62*
Fread 11310381.25 11413531.50 11444208.75*
= real 0m50.209s 0m50.782s 0m49.750s
= user 0m0.195s 0m0.205s 0m0.215s
= sys 0m14.873s 0m15.159s 0m12.911s


Test #4 iozone -t 4 -R -r 32K -s 460M -I +Z
Initial write 3841474.94 3859279.81 5309988.88*
Rewrite 3905526.25 3917309.62 6814800.62*
Read 16233054.50 14843560.25 16352283.75*
Re-read 16335506.50 15529152.25 16352570.00*
Reverse Read 15316394.50* 14225482.50 15004897.50
Stride read 14799380.25* 14064034.25 14355184.25
Random read 14683771.00 14206928.50 14814913.00*
Mixed workload 9058851.50 9180650.75 10815917.50*
Random write 3990585.94 4004757.00 6722088.50*
Pwrite 3318836.12 3468977.69 4244747.69*
Pread 5894538.16* 5588046.38 5847345.62
Fwrite 2227353.75 2186688.62 2386974.88*
Fread 12046094.00 12240004.75* 12073956.75
= real 0m48.561s 0m48.839s 0m48.142s
= user 0m0.155s 0m0.170s 0m0.133s
= sys 0m13.650s 0m13.684s 0m10.790s


Test #5 iozone -t 5 -R -r 40K -s 307M -I +Z
Initial write 4034878.94 4026610.69 5775746.12*
Rewrite 3898600.44 3901114.16 6923764.19*
Read 14947360.88 16698824.25* 10155333.62
Re-read 15844580.75* 15344057.00 9869874.38
Reverse Read 7459156.95 9023317.86* 7648295.03
Stride read 10823891.81 9615553.81 11231183.72*
Random read 10391702.56* 9740935.75 10048038.28
Mixed workload 8261830.94 10175925.00* 7535763.75
Random write 3951423.31 3960984.62 6671441.38*
Pwrite 4119023.12 4097204.56 5975659.12*
Pread 6072076.73* 4338668.50 6020808.34
Fwrite 2417235.47 2337875.88 2665450.62*
Fread 13393630.25 13648332.00* 13395391.00
= real 0m47.756s 0m47.939s 0m47.483s
= user 0m0.128s 0m0.128s 0m0.119s
= sys 0m10.361s 0m10.392s 0m8.717s


Test #6 iozone -t 6 -R -r 48K -s 204M -I +Z
Initial write 4134932.97 4137171.88 5983193.31*
Rewrite 3928131.31 3950764.00 7124248.00*
Read 10965005.75* 10152236.50 9856572.88
Re-read 9386946.00 10776231.38 14303174.12*
Reverse Read 6035244.89 7456152.38* 5999446.38
Stride read 8041000.75 7995307.75 10182936.75*
Random read 8565099.09 10487707.58* 8694877.25
Mixed workload 5301593.06 7332589.09* 6802251.06
Random write 4046482.56 3986854.94 6723824.56*
Pwrite 4188226.41 4214513.34 6245278.44*
Pread 3452596.86 3708694.69* 3486420.41
Fwrite 2829500.22 3030742.72 3033792.28*
Fread 13331387.75 13490416.50 14940410.25*
= real 0m47.150s 0m47.050s 0m47.044s
= user 0m0.106s 0m0.100s 0m0.094s
= sys 0m9.238s 0m8.804s 0m6.930s


Test #7 iozone -t 7 -R -r 56K -s 131M -I +Z
Initial write 4169480.84 4116331.03 5946801.38*
Rewrite 3993155.97 3986195.00 6928142.44*
Read 18901600.25* 10088918.69 6699592.78
Re-read 8738544.69 14881309.62* 13960026.06
Reverse Read 5008919.08 7923949.95* 5495212.41
Stride read 7029436.75 8747574.91* 6477087.25
Random read 6994738.56* 5448687.81 6585235.53
Mixed workload 5178632.44 5258914.92 5587421.81*
Random write 4008977.78 3928116.88 6816453.12*
Pwrite 4342852.09 4154319.09 6124520.06*
Pread 3880318.99 2978587.56 4493903.14*
Fwrite 5557990.03 2923556.59 6126649.94*
Fread 14451722.00 15281179.62* 14675436.50
= real 0m46.321s 0m46.458s 0m45.791s
= user 0m0.093s 0m0.089s 0m0.095s
= sys 0m6.961s 0m6.600s 0m5.499s


Test #8 iozone -t 8 -R -r 64K -s 76M -I +Z
Initial write 4354783.88 4392731.31 6337397.50*
Rewrite 4070162.69 3974051.50 7587279.81*
Read 10095324.56 17945227.88* 8359665.56
Re-read 12316555.88 20468303.75* 7949999.34
Reverse Read 4924659.84 8542573.33* 6388858.72
Stride read 10895715.69 14828968.38* 6107484.81
Random read 6838537.34 14352104.25* 5389174.97
Mixed workload 5805646.75 8391745.53* 6052748.25
Random write 4148973.38 3890847.38 7247214.19*
Pwrite 4309372.41 4423800.34 6863604.69*
Pread 4875766.02* 4042375.33 3692948.91
Fwrite 6102404.31 6021884.41 6634112.09*
Fread 15485971.12* 14900780.62 13981842.50
= real 0m45.618s 0m45.753s 0m45.619s
= user 0m0.071s 0m0.080s 0m0.060s
= sys 0m4.702s 0m4.430s 0m3.555s


Test #9 iozone -t 9 -R -r 72K -s 34M -I +Z
Initial write 4202354.67 4208936.34 6300798.88*
Rewrite 4046855.38 4294137.50 7623323.69*
Read 10926571.88 13304801.81* 10895587.19
Re-read 17725984.94* 7964431.25 12394078.50
Reverse Read 5843121.72 5851846.66* 4075657.20
Stride read 9688998.59 10306234.70* 5566376.62
Random read 7656689.97 8660602.06* 5437182.36
Mixed workload 6229215.62 11205238.73* 5575719.75
Random write 4094822.22 4517401.86 6601624.94*
Pwrite 4274497.50 4263936.64 6844453.11*
Pread 6525075.62* 6043725.62 5745003.28
Fwrite 5958798.56 8430354.78* 7636085.00
Fread 18636725.12* 17268959.12 16618803.62
= real 0m44.945s 0m44.816s 0m45.194s
= user 0m0.062s 0m0.060s 0m0.060s
= sys 0m2.187s 0m2.223s 0m1.888s


Test #10 iozone -t 10 -R -r 80K -s 0M -I +Z
Initial write 3213973.56 2731512.62 4416466.25*
Rewrite 3066956.44* 2693819.50 332671.94
Read 7769523.25* 2681473.75 462840.44
Re-read 5244861.75 5473037.00* 382183.03
Reverse Read 7479397.25* 4869597.75 374714.06
Stride read 5403282.50* 5385083.75 382473.44
Random read 5131997.25 5176799.75* 380593.56
Mixed workload 3998043.25 4219049.00* 1645850.45
Random write 3452832.88 3290861.69 3588531.75*
Pwrite 3757435.81 2711756.47 4561807.88*
Pread 2743595.25* 2635835.00 412947.98
Fwrite 16076549.00 16741977.25* 14797209.38
Fread 23581812.62* 21664184.25 5064296.97
= real 0m44.490s 0m44.444s 0m44.609s
= user 0m0.054s 0m0.049s 0m0.055s
= sys 0m0.037s 0m0.046s 0m0.148s


so when the number of active tasks become larger than the number
of online CPUS, iozone reports a bit hard to understand data. I
can assume that since now we keep the preemption disabled longer
in write path, a concurrent operation (READ or WRITE) cannot preempt
current anymore... slightly suspicious.

the other hard to understand thing is why do READ-only tests have
such a huge jitter. READ-only tests don't depend on streams, they
don't even use them, we supply compressed data directly to
decompression api.

may be better retire iozone and never use it again.


"118 insertions(+), 238 deletions(-)" the patches remove a big
pile of code.

-ss