RE: [PATCH v6] mm/zswap: move to use crypto_acomp API for hardware acceleration

From: Song Bao Hua (Barry Song)
Date: Tue Sep 29 2020 - 07:06:40 EST




> -----Original Message-----
> From: Sebastian Andrzej Siewior [mailto:bigeasy@xxxxxxxxxxxxx]
> Sent: Tuesday, September 29, 2020 11:29 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Cc: akpm@xxxxxxxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; Luis Claudio R . Goncalves
> <lgoncalv@xxxxxxxxxx>; Mahipal Challa <mahipalreddy2006@xxxxxxxxx>;
> Seth Jennings <sjenning@xxxxxxxxxx>; Dan Streetman <ddstreet@xxxxxxxx>;
> Vitaly Wool <vitaly.wool@xxxxxxxxxxxx>; Wangzhou (B)
> <wangzhou1@xxxxxxxxxxxxx>; fanghao (A) <fanghao11@xxxxxxxxxx>; Colin
> Ian King <colin.king@xxxxxxxxxxxxx>
> Subject: Re: [PATCH v6] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
>
> On 2020-09-29 10:02:15 [+0000], Song Bao Hua (Barry Song) wrote:
> > > My point was that there will be a warning at run-time and you don't
> > > want that. There are raw_ accessors if you know what you are doing.
> > > But…
> >
> > I have only seen get_cpu_ptr/var() things will disable preemption. I
> > don't think we will have a warning as this_cpu_ptr() won't disable
> preemption.
>
> Good. Just enable CONFIG_DEBUG_PREEMPT and tell please what happens.
>

Probably got what you mean while this_cpu_ptr should usually be called in preemption-disabled
context. Will do further test to check and verify carefully with DEBUG_PREEMPT.

So your suggestion is actually the below?
a. get_cpu_ptr // for acomp_ctx and disable preemption
b. raw_cpu_ptr or this_cpu_ptr // for dstmem+ mutex
c. put_cpu_ptr //enable preemption

in this way, all steps "a" to "c" are strictly in a preemption-disabled context.

> > > Earlier you had compression/decompression with disabled preemption
> > > and
> >
> > No. that is right now done in enabled preemption context with this
> > patch. The code before this patch was doing (de)compression in
> preemption-disabled context by using get_cpu_ptr and get_cpu_var.
>
> Exactly what I am saying. And within this get_cpu_ptr() section there was the
> compression/decompression sitting. So compression/decompression happend
> while preemtion was off.
>
> > > strict per-CPU memory allocation. Now if you keep this per-CPU
> > > memory allocation then you gain a possible bottleneck.
> > > In the previous email you said that there may be a bottleneck in the
> > > upper layer where you can't utilize all that memory you allocate. So
> > > you may want to rethink that strategy before that rework.
> >
> > we are probably not talking about same thing :-) I was talking about
> > possible generic swap bottleneck. For example, LRU is global, while
> > swapping, multiple cores might have some locks on this LRU. for
> > example, if we have 8 inactive pages to swap out, I am not sure if mm
> > can use 8 cores to swap them out at the same time.
>
> In that case you probably don't need 8* per-CPU memory for this task.

Eventually I got what you mean, it seems you mean we might be able to save some memory
since we have moved to preemption-enabled context for (de)compression, we don’t have to
strictly depend on per-cpu page.

I agree it might be put into todo list to investigate. For the first patch to fix the broken APIs
connection, it seems it is not the proper time to handle this memory saving issue. And it is
actually quite complicated as we need a per-numa pool for dstmem rather than global pool.

>
> > >
> > > > 2. while allocating mutex, we can put the mutex into local memory
> > > > by using
> > > kmalloc_node().
> > > > If we move to "struct mutex lock" directly, most CPUs in a NUMA
> > > > server will
> > > have to access
> > > > remote memory to read/write the mutex, therefore, this will
> > > > increase the
> > > latency dramatically.
> > >
> > > If you need something per-CPU then DEFINE_PER_CPU() will give it to you.
> >
> > Yes. It is true.
> >
> > > It would be very bad for performance if this allocations were not
> > > from CPU-local memory, right? So what makes you think this is worse
> > > than
> > > kmalloc_node() based allocations?
> >
> > Yes. If your read zswap code, it has considered NUMA very carefully by
> > allocating various memory locally. And in crypto framework, I also added API
> to allocate local compression.
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/com
> > mit/?id=7bc13b5b60e94 this zswap patch has used the new node-aware
> > API.
> >
> > Memory access crossing NUMA node, practically crossing packages, can
> > dramatically increase, like double, triple or more.
>
> So you are telling me, DEFINE_PER_CPU() does not allocate the memory for
> each CPU to be local but kmalloc_node() does?

For the first beginning, they are put together. But after setup_per_cpu_areas()
they are copied to a memory accolated by memblock with node information.
I missed the second part. You are right :-)

Thanks
Barry