Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible

From: Sergey Senozhatsky
Date: Wed Feb 05 2025 - 22:06:09 EST

Next message: Joel Fernandes: "Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice"
Previous message: Damon Ding: "[PATCH v7 1/2] dt-bindings: display: rockchip: Fix label name of hdptxphy for RK3588 HDMI TX Controller"
In reply to: Yosry Ahmed: "Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible"
Next in thread: Sergey Senozhatsky: "Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On (25/02/05 19:06), Yosry Ahmed wrote:
> > > For example, the compaction/migration code could be sleeping holding the
> > > write lock, and a map() call would spin waiting for that sleeping task.
> >
> > write-lock holders cannot sleep, that's the key part.
> >
> > So the rules are:
> >
> > 1) writer cannot sleep
> > - migration/compaction runs in atomic context and grabs
> > write-lock only from atomic context
> > - write-locking function disables preemption before lock(), just to be
> > safe, and enables it after unlock()
> >
> > 2) writer does not spin waiting
> > - that's why there is only write_try_lock function
> > - compaction and migration bail out when they cannot lock the
> > zspage
> >
> > 3) readers can sleep and can spin waiting for a lock
> > - other (even preempted) readers don't block new readers
> > - writers don't sleep, they always unlock
>
> That's useful, thanks. If we go with custom locking we need to document
> this clearly and add debug checks where possible.

Sure. That's what it currently looks like (can always improve)

---
/*
* zspage lock permits preemption on the reader-side (there can be multiple
* readers). Writers (exclusive zspage ownership), on the other hand, are
* always run in atomic context and cannot spin waiting for a (potentially
* preempted) reader to unlock zspage. This, basically, means that writers
* can only call write-try-lock and must bail out if it didn't succeed.
*
* At the same time, writers cannot reschedule under zspage write-lock,
* so readers can spin waiting for the writer to unlock zspage.
*/
static void zspage_read_lock(struct zspage *zspage)
{
atomic_t *lock = &zspage->lock;
int old = atomic_read_acquire(lock);

do {
if (old == ZS_PAGE_WRLOCKED) {
cpu_relax();
old = atomic_read_acquire(lock);
continue;
}
} while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));

#ifdef CONFIG_DEBUG_LOCK_ALLOC
rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
#endif
}

static void zspage_read_unlock(struct zspage *zspage)
{
atomic_dec_return_release(&zspage->lock);

#ifdef CONFIG_DEBUG_LOCK_ALLOC
rwsem_release(&zspage->lockdep_map, _RET_IP_);
#endif
}

static bool zspage_try_write_lock(struct zspage *zspage)
{
atomic_t *lock = &zspage->lock;
int old = ZS_PAGE_UNLOCKED;

preempt_disable();
if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
#endif
return true;
}

preempt_enable();
return false;
}

static void zspage_write_unlock(struct zspage *zspage)
{
atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
rwsem_release(&zspage->lockdep_map, _RET_IP_);
#endif
preempt_enable();
}
---

Maybe I'll just copy-paste the locking rules list, a list is always cleaner.

> > > I wonder if there's a way to rework the locking instead to avoid the
> > > nesting. It seems like sometimes we lock the zspage with the pool lock
> > > held, sometimes with the class lock held, and sometimes with no lock
> > > held.
> > >
> > > What are the rules here for acquiring the zspage lock?
> >
> > Most of that code is not written by me, but I think the rule is to disable
> > "migration" be it via pool lock or class lock.
>
> It seems like we're not holding either of these locks in
> async_free_zspage() when we call lock_zspage(). Is it safe for a
> different reason?

I think we hold size class lock there. async-free is only for pages that
reached 0 usage ratio (empty fullness group), so they don't hold any
objects any more and from her such zspages either get freed or
find_get_zspage() recovers them from fullness 0 and allocates an object.
Both are synchronized by size class lock.

> > Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data
> > patterns the clients have. I suspect we'd need to synchronize RCU every
> > time a zspage is freed: zs_free() [this one is complicated], or migration,
> > or compaction? Sounds like anti-pattern for RCU?
>
> Can't we use kfree_rcu() instead of synchronizing? Not sure if this
> would still be an antipattern tbh.

Yeah, I don't know. The last time I wrongly used kfree_rcu() it caused a
27% performance drop (some internal code). This zspage thingy maybe will
be better, but still has a potential to generate high numbers of RCU calls,
depends on the clients. Probably the chances are too high. Apart from
that, kvfree_rcu() can sleep, as far as I understand, so zram might have
some extra things to deal with, namely slot-free notifications which can
be called from softirq, and always called under spinlock:

mm slot-free -> zram slot-free -> zs_free -> empty zspage -> kfree_rcu

> It just seems like the current locking scheme is really complicated :/

That's very true.

Next message: Joel Fernandes: "Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice"
Previous message: Damon Ding: "[PATCH v7 1/2] dt-bindings: display: rockchip: Fix label name of hdptxphy for RK3588 HDMI TX Controller"
In reply to: Yosry Ahmed: "Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible"
Next in thread: Sergey Senozhatsky: "Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]