Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super

From: Vladimir Davydov
Date: Mon Oct 09 2017 - 04:44:00 EST


On Sun, Oct 08, 2017 at 10:13:57PM +0100, Al Viro wrote:
> On Sun, Oct 08, 2017 at 06:47:46PM +0300, Vladimir Davydov wrote:
> > On Sun, Oct 08, 2017 at 03:03:32AM +0100, Al Viro wrote:
> > > On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote:
> > >
> > > > What's more, we need to be careful about resize vs. drain. Right now it's
> > > > on list_lrus_mutex, but if we drop that around actual resize of an individual
> > > > list_lru, we'll need something else. Would there be any problem if we
> > > > took memcg_cache_ids_sem shared in memcg_offline_kmem()?
> > > >
> > > > The first problem is not fatal - we can e.g. use the sign of the field used
> > > > to store the number of ->memcg_lrus elements (i.e. stashed value of
> > > > memcg_nr_cache_ids at allocation or last resize) to indicate that actual
> > > > freeing is left for resizer...
> > >
> > > Ugh. That spinlock would have to be held over too much work, or bounced back
> > > and forth a lot on memcg shutdowns ;-/ Gets especially nasty if we want
> > > list_lru_destroy() callable from rcu callbacks. Oh, well...
> > >
> > > I still suspect that locking there is too heavy, but it looks like I don't have
> > > a better replacement.
> > >
> > > What are the realistic numbers of memcg on a big system?
> >
> > Several thousand. I guess we could turn list_lrus_mutex into a spin lock
> > by making resize/drain procedures handle list_lru destruction as you
> > suggested above, but list_lru_destroy() would still have to iterate over
> > all elements of list_lru_node->memcg_lrus array to free per-memcg
> > objects, which is too heavy to be performed under sb_lock IMHO.
>
> Hmm... Some observations:
> * struct list_lru_one is fairly small and it doesn't pack well - you
> get 25% overhead on it. Dedicated kmem_cache might've be worth doing.
> * in addition to list_lru_one (all allocated one by one), you have
> an array of pointers to them. And that array never shrinks. What's more,
> the objects pointed to are only 3 times bigger than pointers...
> * how hot is memcg destruction codepath? The only real argument
> in favour of list instead of hlist is list_splice(); if that's not that

This is an LRU list so we add objects to the tail and remove them from
the head (or vice versa). AFAICS hlist wouldn't be enough here as it
doesn't store a pointer to the list tail.

Anyway, no matter which list implementation we use, list or hlist,
we have to synchronize resizing of memcg_lrus against concurrent
list modifications. AFAIU we can achieve that either by holding
list_lru_node->lock while copying the array or using an array of
pointers, in which case we can copy the array without holding the
lock and only take the lock to swap pointers. Since the array may
grow quite big, I decided to use an array of pointers to avoid
holding the lock for too long.

> critical, hlist would do just fine. And that drives the size of
> struct list_lru_one down to two words, at which point getting rid of
> that array of pointers becomes rather attractive. Amount of cacheline
> bouncing might be an issue, but then it might be an overall win; can't
> tell without experiments. It certainly would've simplified the things
> a whole lot, including the rollback on allocation failure during resize,
> etc. And list_lru_destroy() would get several orders of magnitude cheaper...

IMHO at the same time it would complicate list_lru_walk(), because the
isolate() callback may release the list_lru_node->lock, which opens a
time window for the list_lru_one to be reallocated. Currently, we don't
have to care about this, because once we dereferenced a list_lru_one, we
can be sure it won't go away.

> * coallocating ->list with ->node[] is almost certain worth doing -
> AFAICS, it's a clear optimization, no matter whether we do anything else
> or not. Loops by list_lrus would be better off without fetching lru->node
> for every entry. _And_ the objects containing list_lru wouldn't be
> reachable via list_lrus.

By co-allocating ->node[] with list_lru we would gain nothing, because
then we wouldn't be able to embed list_lru in super_block and would have
to use a pointer there.

BTW, I think we need to remove alignment from super_block->s_dentry_lru
and s_inode_lru definitions. It seems to be left there from the time
when list_lru wasn't numa aware. Now, the alignment is guaranteed by the
definition of strcut list_lru_node.

>
> As for fs/super.c side... IMO destroy_super() ought to WARN_ON when
> it sees non-NULL ->node. Let alloc_super()/sget_userns() do those
> list_lru_destroy() directly for instances that get killed before
> becoming reachable via shared data structures; everything else must
> go through deactivate_locked_super(). The only reason for list_lru_destroy()
> in destroy_super() is the use of the latter for disposal of never-seen-by-anyone
> struct super_block instances. Come to think of that, something like this
> might be a good approach:

Makes sense to me.

>
> diff --git a/fs/super.c b/fs/super.c
> index 166c4ee0d0ed..8ca15415351a 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -154,21 +154,19 @@ static void destroy_super_rcu(struct rcu_head *head)
> schedule_work(&s->destroy_work);
> }
>
> -/**
> - * destroy_super - frees a superblock
> - * @s: superblock to free
> - *
> - * Frees a superblock.
> - */
> -static void destroy_super(struct super_block *s)
> +/* Free a superblock that has never been seen by anyone */
> +static void destroy_unused_super(struct super_block *s)
> {
> + if (!s)
> + return;
> + up_write(&s->s_umount);
> list_lru_destroy(&s->s_dentry_lru);
> list_lru_destroy(&s->s_inode_lru);
> security_sb_free(s);
> - WARN_ON(!list_empty(&s->s_mounts));
> put_user_ns(s->s_user_ns);
> kfree(s->s_subtype);
> - call_rcu(&s->rcu, destroy_super_rcu);
> + /* no delays needed */
> + destroy_super_work(&s->destroy_work);
> }
>
> /**
> @@ -256,7 +254,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> return s;
>
> fail:
> - destroy_super(s);
> + destroy_unused_super(s);
> return NULL;
> }
>
> @@ -265,11 +263,17 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> /*
> * Drop a superblock's refcount. The caller must hold sb_lock.
> */
> -static void __put_super(struct super_block *sb)
> +static void __put_super(struct super_block *s)
> {
> - if (!--sb->s_count) {
> - list_del_init(&sb->s_list);
> - destroy_super(sb);
> + if (!--s->s_count) {
> + list_del_init(&s->s_list);
> + WARN_ON(s->s_dentry_lru.node);
> + WARN_ON(s->s_inode_lru.node);
> + WARN_ON(!list_empty(&s->s_mounts));
> + security_sb_free(s);
> + put_user_ns(s->s_user_ns);
> + kfree(s->s_subtype);
> + call_rcu(&s->rcu, destroy_super_rcu);
> }
> }
>
> @@ -484,19 +488,12 @@ struct super_block *sget_userns(struct file_system_type *type,
> continue;
> if (user_ns != old->s_user_ns) {
> spin_unlock(&sb_lock);
> - if (s) {
> - up_write(&s->s_umount);
> - destroy_super(s);
> - }
> + destroy_unused_super(s);
> return ERR_PTR(-EBUSY);
> }
> if (!grab_super(old))
> goto retry;
> - if (s) {
> - up_write(&s->s_umount);
> - destroy_super(s);
> - s = NULL;
> - }
> + destroy_unused_super(s);
> return old;
> }
> }
> @@ -511,8 +508,7 @@ struct super_block *sget_userns(struct file_system_type *type,
> err = set(s, data);
> if (err) {
> spin_unlock(&sb_lock);
> - up_write(&s->s_umount);
> - destroy_super(s);
> + destroy_unused_super(s);
> return ERR_PTR(err);
> }
> s->s_type = type;
>