Re: [RFC][PATCH v2 3/3] mm/zsmalloc: increase ZS_MAX_PAGES_PER_ZSPAGE

From: Sergey Senozhatsky
Date: Tue Feb 23 2016 - 05:34:20 EST


On (02/23/16 17:25), Minchan Kim wrote:
[..]
>
> That sounds like a plan but at a first glance, my worry is we might need
> some special handling related to objs_per_zspage and pages_per_zspage
> because currently, we have assumed all of zspages in a class has same
> number of subpages so it might make it ugly.

I did some further testing, and something has showed up that I want
to discuss before we go with ORDER4 (here and later ORDER4 stands for
`#define ZS_MAX_HUGE_ZSPAGE_ORDER 4' for simplicity).

/*
* for testing purposes I have extended zsmalloc pool stats with zs_can_compact() value.
* see below
*/

And the thing is -- quite huge internal class fragmentation. These are the 'normal'
classes, not affected by ORDER modification in any way:

class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage compact
107 1744 1 23 196 76 84 3 51
111 1808 0 0 63 63 28 4 0
126 2048 0 160 568 408 284 1 80
144 2336 52 620 8631 5747 4932 4 1648
151 2448 123 406 10090 8736 6054 3 810
168 2720 0 512 15738 14926 10492 2 540
190 3072 0 2 136 130 102 3 3


so I've been thinking about using some sort of watermaks (well, zsmalloc is an allocator
after all, allocators love watermarks :-)). we can't defeat this fragmentation, we never
know in advance which of the pages will be modified or we the size class those pages will
land after compression. but we know stats for every class -- zs_can_compact(),
obj_allocated/obj_used, etc. so we can start class compaction if we detect that internal
fragmentation is too high (e.g. 30+% of class pages can be compacted).

on the other hand, we always can wait for the shrinker to come in and do the job for us,
but that can take some time.

what's your opinion on this?



The test.

1) create 2G zram, ext4, lzo, device
2) create 1G of text files, 1G of binary files -- the last part is tricky. binary files
in general already imply some sort of compression, so the chances that binary files
will just pressure 4096 class are very high. in my test I use vmscan.c as a text file,
and vmlinux as a binary file: seems to fit perfect, it warm ups all of the "ex-huge"
classes on my system:

202 3264 1 0 17820 17819 14256 4 0
206 3328 0 1 10096 10087 8203 13 0
207 3344 0 1 3212 3206 2628 9 0
208 3360 0 1 1785 1779 1470 14 0
211 3408 0 0 10662 10662 8885 5 0
212 3424 0 1 1881 1876 1584 16 0
214 3456 0 1 5174 5170 4378 11 0
217 3504 0 0 6181 6181 5298 6 0
219 3536 0 1 4410 4406 3822 13 0
222 3584 0 1 5224 5220 4571 7 0
223 3600 0 1 952 946 840 15 0
225 3632 1 0 1638 1636 1456 8 0
228 3680 0 1 1410 1403 1269 9 0
230 3712 1 0 462 461 420 10 0
232 3744 0 1 528 519 484 11 0
234 3776 0 1 559 554 516 12 0
235 3792 0 1 70 57 65 13 0
236 3808 1 0 105 104 98 14 0
238 3840 0 1 176 166 165 15 0
254 4096 0 0 1944 1944 1944 1 0


3) MAIN-test:
for j in {2..10}; do
create_test_files
truncate_bin_files $j
truncate_text_files $j
remove_test_files
done

so it creates text and binary files, truncates them, removes, and does the whole thing again.
the truncation is 1/2, 1/3 ... 1/10 of then original file size.
the order of file modifications is preserved across all of the tests.

4) SUB-test (gzipped files pressure 4096 class mostly, but I decided to keep it)
`gzip -9' all text files
create file copy for every gzipped file "cp FOO.gz FOO", so `gzip -d' later has to overwrite FOO file content
`gzip -d' all text files

5) goto 1



I'll just post a shorter version of the results
(two columns from zram's mm_stat: total_used_mem / max_used_mem)

#1 BASE ORDER4
INITIAL STATE 1016832000 / 1016832000 968470528 / 968470528
TRUNCATE BIN 1/2 715878400 / 1017081856 744165376 / 968691712
TRUNCATE TEXT 1/2 388759552 / 1017081856 417140736 / 968691712
REMOVE FILES 6467584 / 1017081856 6754304 / 968691712

* see below


#2
INITIAL STATE 1021116416 / 1021116416 972718080 / 972718080
TRUNCATE BIN 1/3 683802624 / 1021378560 683589632 / 972955648
TRUNCATE TEXT 1/3 244162560 / 1021378560 244170752 / 972955648
REMOVE FILES 12943360 / 1021378560 11587584 / 972955648

#3
INITIAL STATE 1023041536 / 1023041536 974557184 / 974557184
TRUNCATE BIN 1/4 685211648 / 1023049728 685113344 / 974581760
TRUNCATE TEXT 1/4 189755392 / 1023049728 189194240 / 974581760
REMOVE FILES 14589952 / 1023049728 13537280 / 974581760

#4
INITIAL STATE 1023139840 / 1023139840 974815232 / 974815232
TRUNCATE BIN 1/5 685199360 / 1023143936 686104576 / 974823424
TRUNCATE TEXT 1/5 156557312 / 1023143936 156545024 / 974823424
REMOVE FILES 14704640 / 1023143936 14594048 / 974823424


#COMPRESS/DECOMPRESS test
INITIAL STATE 1022980096 / 1023135744 974516224 / 974749696
COMPRESS TEXT 1120362496 / 1124478976 1072607232 / 1076731904
DECOMPRESS TEXT 1024786432 / 1124478976 976502784 / 1076731904


Test #1 suffers from fragmentation, the pool stats for that test are:

100 1632 1 6 95 73 38 2 8
107 1744 0 18 154 60 66 3 39
111 1808 0 1 36 33 16 4 0
126 2048 0 41 208 167 104 1 20
144 2336 52 588 28637 26079 16364 4 1460
151 2448 113 396 37705 36391 22623 3 786
168 2720 0 525 69378 68561 46252 2 544
190 3072 0 123 1476 1222 1107 3 189
202 3264 25 97 1995 1685 1596 4 248
206 3328 11 119 2144 786 1742 13 1092
207 3344 0 91 1001 259 819 9 603
208 3360 0 69 1173 157 966 14 826
211 3408 20 114 1758 1320 1465 5 365
212 3424 0 63 1197 169 1008 16 864
214 3456 5 97 1326 506 1122 11 693
217 3504 27 109 1232 737 1056 6 420
219 3536 0 92 1380 383 1196 13 858
222 3584 4 131 1168 573 1022 7 518
223 3600 0 37 629 70 555 15 480
225 3632 0 99 891 377 792 8 456
228 3680 0 31 310 59 279 9 225
230 3712 0 0 0 0 0 10 0
232 3744 0 28 336 68 308 11 242
234 3776 0 14 182 28 168 12 132


Note that all of the classes (for example the leader is 2336) are significantly
fragmented. With ORDER4 we have more classes that just join the "let's fragment
party" and add up to the numbers.



So, dynamic page allocation is good, but we also would need a dynamic page
release. And it sounds to me that class watermark is a much simpler thing
to do.

Even if we abandon the idea of having ORDER4, the class fragmentation would
not go away.



> As well, please write down why order-4 for MAX_ZSPAGES is best
> if you resend it as formal patch.

sure, if it will ever be a formal patch then I'll put more effort into documenting.




** The stat patch:

we have only numbers of FULL and ALMOST_EMPTY classes, but they don't tell
us how badly the class is fragmented internally.

so the /sys/kernel/debug/zsmalloc/zram0/classes output now looks as follows:

class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage compact
[..]
12 224 0 2 146 5 8 4 4
13 240 0 0 0 0 0 1 0
14 256 1 13 1840 1672 115 1 10
15 272 0 0 0 0 0 1 0
[..]
49 816 0 3 745 735 149 1 2
51 848 3 4 361 306 76 4 8
52 864 12 14 378 268 81 3 21
54 896 1 12 117 57 26 2 12
57 944 0 0 0 0 0 3 0
[..]
Total 26 131 12709 10994 1071 134


for example, class-896 is heavily fragmented -- it occupies 26 pages, 12 can be
freed by compaction.


does it look to you good enough to be committed on its own (off the series)?

====8<====8<====

From: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx>
Subject: [PATCH] mm/zsmalloc: add can_compact to pool stat

---
mm/zsmalloc.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 43e4cbc..046d364 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -494,6 +494,8 @@ static void __exit zs_stat_exit(void)
debugfs_remove_recursive(zs_stat_root);
}

+static unsigned long zs_can_compact(struct size_class *class);
+
static int zs_stats_size_show(struct seq_file *s, void *v)
{
int i;
@@ -501,14 +503,15 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
struct size_class *class;
int objs_per_zspage;
unsigned long class_almost_full, class_almost_empty;
- unsigned long obj_allocated, obj_used, pages_used;
+ unsigned long obj_allocated, obj_used, pages_used, compact;
unsigned long total_class_almost_full = 0, total_class_almost_empty = 0;
unsigned long total_objs = 0, total_used_objs = 0, total_pages = 0;
+ unsigned long total_compact = 0;

- seq_printf(s, " %5s %5s %11s %12s %13s %10s %10s %16s\n",
+ seq_printf(s, " %5s %5s %11s %12s %13s %10s %10s %16s %7s\n",
"class", "size", "almost_full", "almost_empty",
"obj_allocated", "obj_used", "pages_used",
- "pages_per_zspage");
+ "pages_per_zspage", "compact");

for (i = 0; i < zs_size_classes; i++) {
class = pool->size_class[i];
@@ -521,6 +524,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
class_almost_empty = zs_stat_get(class, CLASS_ALMOST_EMPTY);
obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
obj_used = zs_stat_get(class, OBJ_USED);
+ compact = zs_can_compact(class);
spin_unlock(&class->lock);

objs_per_zspage = get_maxobj_per_zspage(class->size,
@@ -528,23 +532,25 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
pages_used = obj_allocated / objs_per_zspage *
class->pages_per_zspage;

- seq_printf(s, " %5u %5u %11lu %12lu %13lu %10lu %10lu %16d\n",
+ seq_printf(s, " %5u %5u %11lu %12lu %13lu"
+ " %10lu %10lu %16d %7lu\n",
i, class->size, class_almost_full, class_almost_empty,
obj_allocated, obj_used, pages_used,
- class->pages_per_zspage);
+ class->pages_per_zspage, compact);

total_class_almost_full += class_almost_full;
total_class_almost_empty += class_almost_empty;
total_objs += obj_allocated;
total_used_objs += obj_used;
total_pages += pages_used;
+ total_compact += compact;
}

seq_puts(s, "\n");
- seq_printf(s, " %5s %5s %11lu %12lu %13lu %10lu %10lu\n",
+ seq_printf(s, " %5s %5s %11lu %12lu %13lu %10lu %10lu %16s %7lu\n",
"Total", "", total_class_almost_full,
total_class_almost_empty, total_objs,
- total_used_objs, total_pages);
+ total_used_objs, total_pages, "", total_compact);

return 0;
}
--
2.7.1