Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible

From: Yosry Ahmed
Date: Wed Feb 05 2025 - 14:06:36 EST


On Wed, Feb 05, 2025 at 11:43:16AM +0900, Sergey Senozhatsky wrote:
> On (25/02/04 17:19), Yosry Ahmed wrote:
> > > sizeof(struct zs_page) change is one thing. Another thing is that
> > > zspage->lock is taken from atomic sections, pretty much everywhere.
> > > compaction/migration write-lock it under pool rwlock and class spinlock,
> > > but both compaction and migration now EAGAIN if the lock is locked
> > > already, so that is sorted out.
> > >
> > > The remaining problem is map(), which takes zspage read-lock under pool
> > > rwlock. RFC series (which you hated with passion :P) converted all zsmalloc
> > > into preemptible ones because of this - zspage->lock is a nested leaf-lock,
> > > so it cannot schedule unless locks it's nested under permit it (needless to
> > > say neither rwlock nor spinlock permit it).
> >
> > Hmm, so we want the lock to be preemtible, but we don't want to use an
> > existing preemtible lock because it may be held it from atomic context.
> >
> > I think one problem here is that the lock you are introducing is a
> > spinning lock but the lock holder can be preempted. This is why spinning
> > locks do not allow preemption. Others waiting for the lock can spin
> > waiting for a process that is scheduled out.
> >
> > For example, the compaction/migration code could be sleeping holding the
> > write lock, and a map() call would spin waiting for that sleeping task.
>
> write-lock holders cannot sleep, that's the key part.
>
> So the rules are:
>
> 1) writer cannot sleep
> - migration/compaction runs in atomic context and grabs
> write-lock only from atomic context
> - write-locking function disables preemption before lock(), just to be
> safe, and enables it after unlock()
>
> 2) writer does not spin waiting
> - that's why there is only write_try_lock function
> - compaction and migration bail out when they cannot lock the
> zspage
>
> 3) readers can sleep and can spin waiting for a lock
> - other (even preempted) readers don't block new readers
> - writers don't sleep, they always unlock

That's useful, thanks. If we go with custom locking we need to document
this clearly and add debug checks where possible.

>
> > I wonder if there's a way to rework the locking instead to avoid the
> > nesting. It seems like sometimes we lock the zspage with the pool lock
> > held, sometimes with the class lock held, and sometimes with no lock
> > held.
> >
> > What are the rules here for acquiring the zspage lock?
>
> Most of that code is not written by me, but I think the rule is to disable
> "migration" be it via pool lock or class lock.

It seems like we're not holding either of these locks in
async_free_zspage() when we call lock_zspage(). Is it safe for a
different reason?

>
> > Do we need to hold another lock just to make sure the zspage does not go
> > away from under us?
>
> Yes, the page cannot go away via "normal" path:
> zs_free(last object) -> zspage becomes empty -> free zspage
>
> so when we have active mapping() it's only migration and compaction
> that can free zspage (its content is migrated and so it becomes empty).
>
> > Can we use RCU or something similar to do that instead?
>
> Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data
> patterns the clients have. I suspect we'd need to synchronize RCU every
> time a zspage is freed: zs_free() [this one is complicated], or migration,
> or compaction? Sounds like anti-pattern for RCU?

Can't we use kfree_rcu() instead of synchronizing? Not sure if this
would still be an antipattern tbh. It just seems like the current
locking scheme is really complicated :/