Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
From: Kairui Song
Date: Thu Feb 06 2025 - 01:55:47 EST
On Mon, Feb 3, 2025 at 11:49 AM Sergey Senozhatsky
<senozhatsky@xxxxxxxxxxxx> wrote:
>
> On (25/02/01 17:21), Kairui Song wrote:
> > This seems will cause a huge regression of performance on multi core
> > systems, this is especially significant as the number of concurrent
> > tasks increases:
> >
> > Test build linux kernel using ZRAM as SWAP (1G memcg):
> >
> > Before:
> > + /usr/bin/time make -s -j48
> > 2495.77user 2604.77system 2:12.95elapsed 3836%CPU (0avgtext+0avgdata
> > 863304maxresident)k
> >
> > After:
> > + /usr/bin/time make -s -j48
> > 2403.60user 6676.09system 3:38.22elapsed 4160%CPU (0avgtext+0avgdata
> > 863276maxresident)k
>
> How many CPUs do you have? I assume, preemption gets into way which is
> sort of expected, to be honest... Using per-CPU compression streams
> disables preemption and uses CPU exclusively at a price of other tasks
> not being able to run. I do tend to think that I made a mistake by
> switching zram to per-CPU compression streams.
>
> What preemption model do you use and to what extent do you overload
> your system?
>
> My tests don't show anything unusual (but I don't overload the system)
>
> CONFIG_PREEMPT
I'm using CONFIG_PREEMPT_VOLUNTARY=y, and there are 96 logical CPUs
(48c96t), make -j48 shouldn't be considered overload I think. make
-j32 also showed an obvious slow down.
>
> before
> 1371.96user 156.21system 1:30.91elapsed 1680%CPU (0avgtext+0avgdata 825636maxresident)k
> 32688inputs+1768416outputs (259major+51539861minor)pagefaults 0swaps
>
> after
> 1372.05user 155.79system 1:30.82elapsed 1682%CPU (0avgtext+0avgdata 825684maxresident)k
> 32680inputs+1768416outputs (273major+51541815minor)pagefaults 0swaps
>
> (I use zram as a block device with ext4 on it.)
I'm testing with ZRAM as SWAP, and tmpfs as storage for the kernel
source code, with memory pressure inside a 2G or smaller mem cgroup
(depend on make -j48 or -j32).
>
> > `perf lock contention -ab sleep 3` also indicates the big spin lock in
> > zcomp_stream_get/put is having significant contention:
>
> Hmm it's just
>
> spin_lock()
> list first entry
> spin_unlock()
>
> Shouldn't be "a big spin lock", that's very odd. I'm not familiar with
> perf lock contention, let me take a look.
I can debug this a bit more to figure out why the contention is huge
later, but my first thought is that, as Yosry also mentioned in
another reply, making it preemptable doesn't necessarily mean the per
CPU stream has to be gone.