Re: [External] RE: kernel warning percpu ref in obj_cgroup_release

From: Muchun Song
Date: Tue Mar 30 2021 - 12:27:16 EST


On Tue, Mar 30, 2021 at 11:10 PM Christian Borntraeger
<borntraeger@xxxxxxxxxx> wrote:
>
>
> On 30.03.21 15:49, Muchun Song wrote:
> > On Tue, Mar 30, 2021 at 9:27 PM Christian Borntraeger
> > <borntraeger@xxxxxxxxxx> wrote:
> >>
> >> So bisect shows this for belows warning:
> >
> > Thanks for your effort on this. Can you share your config?
>
> attached (but its s390x) for next-20210330

Thanks. Can you apply the following patch and help me test?
Very Thanks.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7fdc92e1983e..579408e4d46f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -793,6 +793,12 @@ static inline void obj_cgroup_get(struct obj_cgroup *objcg)
percpu_ref_get(&objcg->refcnt);
}

+static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
+ unsigned long nr)
+{
+ percpu_ref_get_many(&objcg->refcnt, nr);
+}
+
static inline void obj_cgroup_put(struct obj_cgroup *objcg)
{
percpu_ref_put(&objcg->refcnt);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c0b83a396299..1634dba1044c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3133,7 +3133,10 @@ void split_page_memcg(struct page *head, unsigned int nr)

for (i = 1; i < nr; i++)
head[i].memcg_data = head->memcg_data;
- css_get_many(&memcg->css, nr - 1);
+ if (PageMemcgKmem(head))
+ obj_cgroup_get_many(__page_objcg(head), nr - 1);
+ else
+ css_get_many(&memcg->css, nr - 1);
}

#ifdef CONFIG_MEMCG_SWAP

>
> The problem goes away when I add
> cgroup_controllers = [ ]
> to /etc/libvirt/qemu.conf
>
> The testcase that triggers the problem starts and stops multipe KVM guests with 248 CPUs.
> Do we happen to have maybe only a byte of refcount space?
>
>
> >
> >>
> >> 636c3ef8229ecb4e7d045e86f36505d24a8f019a is the first bad commit
> >> commit 636c3ef8229ecb4e7d045e86f36505d24a8f019a
> >> Author: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> >> Date: Mon Mar 29 11:12:06 2021 +1100
> >>
> >> mm: memcontrol: use obj_cgroup APIs to charge kmem pages
> >>
> >> Since Roman's series "The new cgroup slab memory controller" applied. All
> >> slab objects are charged via the new APIs of obj_cgroup. The new APIs
> >> introduce a struct obj_cgroup to charge slab objects. It prevents
> >> long-living objects from pinning the original memory cgroup in the memory.
> >> But there are still some corner objects (e.g. allocations larger than
> >> order-1 page on SLUB) which are not charged via the new APIs. Those
> >> objects (include the pages which are allocated from buddy allocator
> >> directly) are charged as kmem pages which still hold a reference to the
> >> memory cgroup.
> >>
> >> We want to reuse the obj_cgroup APIs to charge the kmem pages. If we do
> >> that, we should store an object cgroup pointer to page->memcg_data for the
> >> kmem pages.
> >>
> >> Finally, page->memcg_data will have 3 different meanings.
> >>
> >> 1) For the slab pages, page->memcg_data points to an object cgroups
> >> vector.
> >>
> >> 2) For the kmem pages (exclude the slab pages), page->memcg_data
> >> points to an object cgroup.
> >>
> >> 3) For the user pages (e.g. the LRU pages), page->memcg_data points
> >> to a memory cgroup.
> >>
> >> We do not change the behavior of page_memcg() and page_memcg_rcu(). They
> >> are also suitable for LRU pages and kmem pages. Why?
> >>
> >> Because memory allocations pinning memcgs for a long time - it exists at a
> >> larger scale and is causing recurring problems in the real world: page
> >> cache doesn't get reclaimed for a long time, or is used by the second,
> >> third, fourth, ... instance of the same job that was restarted into a new
> >> cgroup every time. Unreclaimable dying cgroups pile up, waste memory, and
> >> make page reclaim very inefficient.
> >>
> >> We can convert LRU pages and most other raw memcg pins to the objcg
> >> direction to fix this problem, and then the page->memcg will always point
> >> to an object cgroup pointer. At that time, LRU pages and kmem pages will
> >> be treated the same. The implementation of page_memcg() will remove the
> >> kmem page check.
> >>
> >> This patch aims to charge the kmem pages by using the new APIs of
> >> obj_cgroup. Finally, the page->memcg_data of the kmem page points to an
> >> object cgroup. We can use the __page_objcg() to get the object cgroup
> >> associated with a kmem page. Or we can use page_memcg() to get the memory
> >> cgroup associated with a kmem page, but caller must ensure that the
> >> returned memcg won't be released (e.g. acquire the rcu_read_lock or
> >> css_set_lock).
> >>
> >> Link: https://lkml.kernel.org/r/20210319163821.20704-6-songmuchun@xxxxxxxxxxxxx
> >> Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> >> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> >> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> >> Cc: Roman Gushchin <guro@xxxxxx>
> >> Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
> >> Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
> >> Cc: Xiongchun Duan <duanxiongchun@xxxxxxxxxxxxx>
> >> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> >> Signed-off-by: Stephen Rothwell <sfr@xxxxxxxxxxxxxxxx>
> >>
> >> include/linux/memcontrol.h | 116 +++++++++++++++++++++++++++++++++++----------
> >> mm/memcontrol.c | 110 +++++++++++++++++++++---------------------
> >> 2 files changed, 145 insertions(+), 81 deletions(-)
> >>
> >>
> >>
> >>
> >>
> >> On 30.03.21 13:32, Christian Borntraeger wrote:
> >> [...]
> >>>
> >>> This next (328 is fine) triggers several bugs during our KVM CI run:
> >>>
> >>> [ 1506.494716] ------------[ cut here ]------------
> >>> [ 1506.494730] percpu ref (obj_cgroup_release) <= 0 (-1) after switching to atomic
> >>> [ 1506.494766] WARNING: CPU: 6 PID: 0 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x1ea/0x1f8
> >>> [ 1506.494774] Modules linked in: kvm vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct dm_service_time nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink zfcp scsi_transport_fc rpcrdma sunrpc dm_multipath rdma_ucm scsi_dh_rdac scsi_dh_emc rdma_cm scsi_dh_alua iw_cm ib_cm mlx5_ib ib_uverbs dm_mod ib_core s390_trng vfio_ccw vfio_mdev mdev vfio_iommu_type1 zcrypt_cex4 vfio eadm_sch sch_fq_codel configfs ip_tables x_tables ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 mlx5_core sha512_s390 sha256_s390 sha1_s390 sha_common nvme nvme_core pkey zcrypt rng_core autofs4 [last unloaded: vfio_ap]
> >>> [ 1506.494832] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 5.12.0-20210330.rc4.git0.9d49ed9ca93b.300.fc33.s390x+next #1
> >>> [ 1506.494834] Hardware name: IBM 8561 T01 703 (LPAR)
> >>> [ 1506.494836] Krnl PSW : 0704c00180000000 00000002d71dd21e (percpu_ref_switch_to_atomic_rcu+0x1ee/0x1f8)
> >>> [ 1506.494840] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> >>> [ 1506.494842] Krnl GPRS: c0000000fffeffff 00000002f7256818 0000000000000043 00000000fffeffff
> >>> [ 1506.494844] 00000000ffffffea 0000038000000001 0000000000000000 000003800000017c
> >>> [ 1506.494846] 00000002d7924988 0000000227eb97a0 000003ff5413c7e0 7fffffffffffffff
> >>> [ 1506.494848] 0000000080360000 00000002f726b570 00000002d71dd21a 00000380000bba28
> >>> [ 1506.494856] Krnl Code: 00000002d71dd20e: e3309fe8ff04 lg %r3,-24(%r9)
> >>> 00000002d71dd214: c0e5001eb556 brasl %r14,00000002d75b3cc0
> >>> #00000002d71dd21a: af000000 mc 0,0
> >>> >00000002d71dd21e: a7f4ffcc brc 15,00000002d71dd1b6
> >>> 00000002d71dd222: 0707 bcr 0,%r7
> >>> 00000002d71dd224: 0707 bcr 0,%r7
> >>> 00000002d71dd226: 0707 bcr 0,%r7
> >>> 00000002d71dd228: eb6ff0480024 stmg %r6,%r15,72(%r15)
> >>> [ 1506.494928] Call Trace:
> >>> [ 1506.494933] [<00000002d71dd21e>] percpu_ref_switch_to_atomic_rcu+0x1ee/0x1f8
> >>> [ 1506.494940] ([<00000002d71dd21a>] percpu_ref_switch_to_atomic_rcu+0x1ea/0x1f8)
> >>> [ 1506.494942] [<00000002d6b8a6c6>] rcu_do_batch+0x146/0x608
> >>> [ 1506.494946] [<00000002d6b8ec04>] rcu_core+0x124/0x1d0
> >>> [ 1506.494948] [<00000002d75d0222>] __do_softirq+0x13a/0x3c8
> >>> [ 1506.494952] [<00000002d6b05306>] irq_exit+0xce/0xf8
> >>> [ 1506.494955] [<00000002d75c1eb4>] do_ext_irq+0xdc/0x170
> >>> [ 1506.494957] [<00000002d75cdea4>] ext_int_handler+0xc4/0xf4
> >>> [ 1506.494959] [<0000000000000000>] 0x0
> >>> [ 1506.494963] [<00000002d75cd9c2>] default_idle_call+0x42/0x110
> >>> [ 1506.494965] [<00000002d6b411a0>] do_idle+0xd8/0x168
> >>> [ 1506.494968] [<00000002d6b413ee>] cpu_startup_entry+0x36/0x40
> >>> [ 1506.494971] [<00000002d6ac730a>] smp_start_secondary+0x82/0x88
> >>> [ 1506.494974] Last Breaking-Event-Address:
> >>> [ 1506.494975] [<00000002d6b71898>] vprintk_emit+0xa8/0x110
> >>> [ 1506.494978] Kernel panic - not syncing: panic_on_warn set ...
> >>>
> >>>
> >>>
> >>> I will try to bisect this, but if anyone has an idea. CC some candidates.
> >>