Re: zram: per-cpu compression streams

From: Minchan Kim
Date: Thu Mar 24 2016 - 19:40:31 EST


Hi Sergey,

On Wed, Mar 23, 2016 at 05:18:27PM +0900, Sergey Senozhatsky wrote:
> ( was "[PATCH] zram: export the number of available comp streams"
> forked from http://marc.info/?l=linux-kernel&m=145860707516861 )
>
> d'oh.... sorry, now actually forked.
>
>
> Hello Minchan,
>
> forked into a separate tread.
>
> > On (03/22/16 09:39), Minchan Kim wrote:
> > > zram_bvec_write()
> > > {
> > > *get_cpu_ptr(comp-stream);
> > > zcomp_compress();
> > > zs_malloc()
> > > put_cpu_ptr(comp-stream);
> > > }
> > >
> > > this, however, makes zsmalloc unhapy. pool has GFP_NOIO | __GFP_HIGHMEM
> > > gfp, and GFP_NOIO is ___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM. this
> > > __GFP_DIRECT_RECLAIM is in the conflict with per-cpu streams, because
> > > per-cpu streams require disabled preemption (up until we copy stream
> > > buffer to zspage). so what options do we have here... from the top of
> > > my head (w/o a lot of thinking)...
> >
> > Indeed.
> ...
> > How about this?
> >
> > zram_bvec_write()
> > {
> > retry:
> > *get_cpu_ptr(comp-stream);
> > zcomp_compress();
> > handle = zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM| | GFP_NOWARN)
> > if (!handle) {
> > put_cpu_ptr(comp-stream);
> > handle = zs_malloc(gfp);
> > goto retry;
> > }
> > put_cpu_ptr(comp-stream);
> > }
>
> interesting. the retry jump should go higher, we have "user_mem = kmap_atomic(page)"
> which we unmap right after compression, because a) we don't need
> uncompressed memory anymore b) zs_malloc() can sleep and we can't have atomic
> mapping around. the nasty thing here is is_partial_io(). we need to re-do
>
> if (is_partial_io(bvec))
> memcpy(uncmem + offset, user_mem + bvec-bv_offset,
> bvec-bv_len);
>
> once again in the worst case.
>
> so zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM | GFP_NOWARN) so far can cause
> double memcpy() and double compression. just to outline this.
>
>
> the test.
>
> I executed a number of iozone tests, on each iteration re-creating zram
> device (3GB, LZO, EXT4. the box has 4 x86_64 CPUs).
>
> $DEVICE_SZ=3G
> $FREE_SPACE is 10% of $DEVICE_SZ
> time ./iozone -t $i -R -r $((8*$i))K -s $((($DEVICE_SZ/$i - $FREE_SPACE)/(1024*1024)))M -I +Z
>
>
> columns:
>
> TEST MAX_STREAMS 4 MAX_STREAMS 8 PER_CPU STREAMS
> ====================================================================
>
> Test #1 iozone -t 1 -R -r 8K -s 2764M -I +Z
> Initial write 853492.31* 835868.50 839789.56
> Rewrite 1642073.88 1657255.75 1693011.50*
> Read 3384044.00* 3218727.25 3269109.50
> Re-read 3389794.50* 3243187.00 3267422.25
> Reverse Read 3209805.75* 3082040.00 3107957.25
> Stride read 3100144.50* 2972280.25 2923155.25
> Random read 2992249.75* 2874605.00 2854824.25
> Mixed workload 2992274.75* 2878212.25 2883840.00
> Random write 1471800.00 1452346.50 1515678.75*
> Pwrite 802083.00 801627.31 820251.69*
> Pread 3443495.00* 3308659.25 3302089.00
> Fwrite 1880446.88 1838607.50 1909490.00*
> Fread 3479614.75 3091634.75 6442964.50*
> = real 1m4.170s 1m4.513s 1m4.123s
> = user 0m0.559s 0m0.518s 0m0.511s
> = sys 0m18.766s 0m19.264s 0m18.641s
>
>
> Test #2 iozone -t 2 -R -r 16K -s 1228M -I +Z
> Initial write 2102532.12 2051809.19 2419072.50*
> Rewrite 2217024.25 2250930.00 3681559.00*
> Read 7716933.25 7898759.00 8345507.75*
> Re-read 7748487.75 7765282.25 8342367.50*
> Reverse Read 7415254.25 7552637.25 7822691.75*
> Stride read 7041909.50 7091049.25 7401273.00*
> Random read 6205044.25 6738888.50 7232104.25*
> Mixed workload 4582990.00 5271651.50 5361002.88*
> Random write 2591893.62 2513729.88 3660774.38*
> Pwrite 1873876.75 1909758.69 2087238.81*
> Pread 4669850.00 4651121.56 4919588.44*
> Fwrite 1937947.25 1940628.06 2034251.25*
> Fread 9930319.00 9970078.00* 9831422.50
> = real 0m53.844s 0m53.607s 0m52.528s
> = user 0m0.273s 0m0.289s 0m0.280s
> = sys 0m16.595s 0m16.478s 0m14.072s
>
>
> Test #3 iozone -t 3 -R -r 24K -s 716M -I +Z
> Initial write 3036567.50 2998918.25 3683853.00*
> Rewrite 3402447.88 3415685.88 5054705.38*
> Read 11767413.00* 11133789.50 11246497.25
> Re-read 11797680.50* 11092592.00 11277382.00
> Reverse Read 10828320.00* 10157665.50 10749055.00
> Stride read 10532039.50* 9943521.75 10464700.25
> Random read 10380365.75* 9807859.25 10234127.00
> Mixed workload 8772132.50* 8415083.50 8457108.50
> Random write 3364875.00 3310042.00 5059136.38*
> Pwrite 2677290.25 2651309.50 3198166.25*
> Pread 5221799.56* 4963050.69 4987293.78
> Fwrite 2026887.56 2047679.00 2124199.62*
> Fread 11310381.25 11413531.50 11444208.75*
> = real 0m50.209s 0m50.782s 0m49.750s
> = user 0m0.195s 0m0.205s 0m0.215s
> = sys 0m14.873s 0m15.159s 0m12.911s
>
>
> Test #4 iozone -t 4 -R -r 32K -s 460M -I +Z
> Initial write 3841474.94 3859279.81 5309988.88*
> Rewrite 3905526.25 3917309.62 6814800.62*
> Read 16233054.50 14843560.25 16352283.75*
> Re-read 16335506.50 15529152.25 16352570.00*
> Reverse Read 15316394.50* 14225482.50 15004897.50
> Stride read 14799380.25* 14064034.25 14355184.25
> Random read 14683771.00 14206928.50 14814913.00*
> Mixed workload 9058851.50 9180650.75 10815917.50*
> Random write 3990585.94 4004757.00 6722088.50*
> Pwrite 3318836.12 3468977.69 4244747.69*
> Pread 5894538.16* 5588046.38 5847345.62
> Fwrite 2227353.75 2186688.62 2386974.88*
> Fread 12046094.00 12240004.75* 12073956.75
> = real 0m48.561s 0m48.839s 0m48.142s
> = user 0m0.155s 0m0.170s 0m0.133s
> = sys 0m13.650s 0m13.684s 0m10.790s
>
>
> Test #5 iozone -t 5 -R -r 40K -s 307M -I +Z
> Initial write 4034878.94 4026610.69 5775746.12*
> Rewrite 3898600.44 3901114.16 6923764.19*
> Read 14947360.88 16698824.25* 10155333.62
> Re-read 15844580.75* 15344057.00 9869874.38
> Reverse Read 7459156.95 9023317.86* 7648295.03
> Stride read 10823891.81 9615553.81 11231183.72*
> Random read 10391702.56* 9740935.75 10048038.28
> Mixed workload 8261830.94 10175925.00* 7535763.75
> Random write 3951423.31 3960984.62 6671441.38*
> Pwrite 4119023.12 4097204.56 5975659.12*
> Pread 6072076.73* 4338668.50 6020808.34
> Fwrite 2417235.47 2337875.88 2665450.62*
> Fread 13393630.25 13648332.00* 13395391.00
> = real 0m47.756s 0m47.939s 0m47.483s
> = user 0m0.128s 0m0.128s 0m0.119s
> = sys 0m10.361s 0m10.392s 0m8.717s
>
>
> Test #6 iozone -t 6 -R -r 48K -s 204M -I +Z
> Initial write 4134932.97 4137171.88 5983193.31*
> Rewrite 3928131.31 3950764.00 7124248.00*
> Read 10965005.75* 10152236.50 9856572.88
> Re-read 9386946.00 10776231.38 14303174.12*
> Reverse Read 6035244.89 7456152.38* 5999446.38
> Stride read 8041000.75 7995307.75 10182936.75*
> Random read 8565099.09 10487707.58* 8694877.25
> Mixed workload 5301593.06 7332589.09* 6802251.06
> Random write 4046482.56 3986854.94 6723824.56*
> Pwrite 4188226.41 4214513.34 6245278.44*
> Pread 3452596.86 3708694.69* 3486420.41
> Fwrite 2829500.22 3030742.72 3033792.28*
> Fread 13331387.75 13490416.50 14940410.25*
> = real 0m47.150s 0m47.050s 0m47.044s
> = user 0m0.106s 0m0.100s 0m0.094s
> = sys 0m9.238s 0m8.804s 0m6.930s
>
>
> Test #7 iozone -t 7 -R -r 56K -s 131M -I +Z
> Initial write 4169480.84 4116331.03 5946801.38*
> Rewrite 3993155.97 3986195.00 6928142.44*
> Read 18901600.25* 10088918.69 6699592.78
> Re-read 8738544.69 14881309.62* 13960026.06
> Reverse Read 5008919.08 7923949.95* 5495212.41
> Stride read 7029436.75 8747574.91* 6477087.25
> Random read 6994738.56* 5448687.81 6585235.53
> Mixed workload 5178632.44 5258914.92 5587421.81*
> Random write 4008977.78 3928116.88 6816453.12*
> Pwrite 4342852.09 4154319.09 6124520.06*
> Pread 3880318.99 2978587.56 4493903.14*
> Fwrite 5557990.03 2923556.59 6126649.94*
> Fread 14451722.00 15281179.62* 14675436.50
> = real 0m46.321s 0m46.458s 0m45.791s
> = user 0m0.093s 0m0.089s 0m0.095s
> = sys 0m6.961s 0m6.600s 0m5.499s
>
>
> Test #8 iozone -t 8 -R -r 64K -s 76M -I +Z
> Initial write 4354783.88 4392731.31 6337397.50*
> Rewrite 4070162.69 3974051.50 7587279.81*
> Read 10095324.56 17945227.88* 8359665.56
> Re-read 12316555.88 20468303.75* 7949999.34
> Reverse Read 4924659.84 8542573.33* 6388858.72
> Stride read 10895715.69 14828968.38* 6107484.81
> Random read 6838537.34 14352104.25* 5389174.97
> Mixed workload 5805646.75 8391745.53* 6052748.25
> Random write 4148973.38 3890847.38 7247214.19*
> Pwrite 4309372.41 4423800.34 6863604.69*
> Pread 4875766.02* 4042375.33 3692948.91
> Fwrite 6102404.31 6021884.41 6634112.09*
> Fread 15485971.12* 14900780.62 13981842.50
> = real 0m45.618s 0m45.753s 0m45.619s
> = user 0m0.071s 0m0.080s 0m0.060s
> = sys 0m4.702s 0m4.430s 0m3.555s
>
>
> Test #9 iozone -t 9 -R -r 72K -s 34M -I +Z
> Initial write 4202354.67 4208936.34 6300798.88*
> Rewrite 4046855.38 4294137.50 7623323.69*
> Read 10926571.88 13304801.81* 10895587.19
> Re-read 17725984.94* 7964431.25 12394078.50
> Reverse Read 5843121.72 5851846.66* 4075657.20
> Stride read 9688998.59 10306234.70* 5566376.62
> Random read 7656689.97 8660602.06* 5437182.36
> Mixed workload 6229215.62 11205238.73* 5575719.75
> Random write 4094822.22 4517401.86 6601624.94*
> Pwrite 4274497.50 4263936.64 6844453.11*
> Pread 6525075.62* 6043725.62 5745003.28
> Fwrite 5958798.56 8430354.78* 7636085.00
> Fread 18636725.12* 17268959.12 16618803.62
> = real 0m44.945s 0m44.816s 0m45.194s
> = user 0m0.062s 0m0.060s 0m0.060s
> = sys 0m2.187s 0m2.223s 0m1.888s
>
>
> Test #10 iozone -t 10 -R -r 80K -s 0M -I +Z
> Initial write 3213973.56 2731512.62 4416466.25*
> Rewrite 3066956.44* 2693819.50 332671.94
> Read 7769523.25* 2681473.75 462840.44
> Re-read 5244861.75 5473037.00* 382183.03
> Reverse Read 7479397.25* 4869597.75 374714.06
> Stride read 5403282.50* 5385083.75 382473.44
> Random read 5131997.25 5176799.75* 380593.56
> Mixed workload 3998043.25 4219049.00* 1645850.45
> Random write 3452832.88 3290861.69 3588531.75*
> Pwrite 3757435.81 2711756.47 4561807.88*
> Pread 2743595.25* 2635835.00 412947.98
> Fwrite 16076549.00 16741977.25* 14797209.38
> Fread 23581812.62* 21664184.25 5064296.97
> = real 0m44.490s 0m44.444s 0m44.609s
> = user 0m0.054s 0m0.049s 0m0.055s
> = sys 0m0.037s 0m0.046s 0m0.148s
>
>
> so when the number of active tasks become larger than the number
> of online CPUS, iozone reports a bit hard to understand data. I
> can assume that since now we keep the preemption disabled longer
> in write path, a concurrent operation (READ or WRITE) cannot preempt
> current anymore... slightly suspicious.
>
> the other hard to understand thing is why do READ-only tests have
> such a huge jitter. READ-only tests don't depend on streams, they
> don't even use them, we supply compressed data directly to
> decompression api.
>
> may be better retire iozone and never use it again.
>
>
> "118 insertions(+), 238 deletions(-)" the patches remove a big
> pile of code.

First of all, I appreciate you very much!
At a glance, on write workload, huge win but worth to investigate
how such fluctuation/regression happens on read-related test
(read and mixed workload).

Could you send your patchset? I will test it.