Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc()
From: Eric Dumazet
Date: Thu Feb 01 2018 - 18:27:27 EST
Well, this memcg stuff is so confusing.
My recollection is that we had :
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=d979a39d7242e0601bf9b60e89628fb8ac577179
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=75cb070960ade40fba5de32138390f3c85c90941
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=c0576e3975084d4699b7bfef578613fb8e1144f6
And commit a590b90d472f2c176c140576ee3ab44df7f67839 as well
Honestly bug was closed months ago for us, based on stack traces on the wild.
No C repro or whatever, but reproducing it would be a matter of
having a TCP listener constantly doing a
socket()/setsockopt(REUSEADDR)/bind()/listen()/close() in a loop,
while connections are attempted to the listening port.
On Thu, Feb 1, 2018 at 2:55 PM, Roman Gushchin <guro@xxxxxx> wrote:
> On Thu, Feb 01, 2018 at 01:17:56PM -0800, Eric Dumazet wrote:
>> On Thu, Feb 1, 2018 at 12:22 PM, Roman Gushchin <guro@xxxxxx> wrote:
>> > On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote:
>> >> From: Roman Gushchin <guro@xxxxxx>
>> >> Date: Wed, 31 Jan 2018 21:54:08 +0000
>> >>
>> >> > So I really start thinking that reverting 9f1c2674b328
>> >> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
>> >> > and fixing the original issue differently might be easier
>> >> > and a proper way to go. Does it makes sense?
>> >>
>> >> You'll need to work that out with Eric Dumazet who added the
>> >> change in question which you think we should revert.
>> >
>> > Eric,
>> >
>> > can you, please, provide some details about the use-after-free problem
>> > that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call
>> > to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it?
>> >
>> > Deferring mem_cgroup_sk_alloc() breaks socket memory accounting
>> > and makes it much more fragile in general. So, I wonder, if there are
>> > solutions for the use-after-free problem.
>> >
>> > Thank you!
>> >
>> > Roman
>>
>> Unfortunately bug is not public (Google-Bug-Id 67556600 for Googlers
>> following this thread )
>>
>> Our kernel has a debug feature on percpu_ref_get_many() which detects
>> the typical use-after-free problem of
>> doing atomic_long_add(nr, &ref->count); while ref->count is 0, or
>> memory already freed.
>>
>> Bug was serious because css_put() will release the css a second time.
>>
>> Stack trace looked like :
>>
>> Oct 8 00:23:14 lphh23 kernel: [27239.568098] <IRQ>
>> [<ffffffff909d2fb1>] dump_stack+0x4d/0x6c
>> Oct 8 00:23:14 lphh23 kernel: [27239.568108] [<ffffffff906df6e3>] ?
>> cgroup_get+0x43/0x50
>> Oct 8 00:23:14 lphh23 kernel: [27239.568114] [<ffffffff906f2f35>]
>> warn_slowpath_common+0xac/0xc8
>> Oct 8 00:23:14 lphh23 kernel: [27239.568117] [<ffffffff906f2f6b>]
>> warn_slowpath_null+0x1a/0x1c
>> Oct 8 00:23:14 lphh23 kernel: [27239.568120] [<ffffffff906df6e3>]
>> cgroup_get+0x43/0x50
>> Oct 8 00:23:14 lphh23 kernel: [27239.568123] [<ffffffff906e07a4>]
>> cgroup_sk_alloc+0x64/0x90
>
> Hm, that looks strange... It's cgroup_sk_alloc(),
> not mem_cgroup_sk_alloc(), which was removed by 9f1c2674b328.
>
> I thought, that it's css_get() in mem_cgroup_sk_alloc(), which
> you removed, but the stacktrace you've posted is different.
>
> void mem_cgroup_sk_alloc(struct sock *sk) {
> /*
> * Socket cloning can throw us here with sk_memcg already
> * filled. It won't however, necessarily happen from
> * process context. So the test for root memcg given
> * the current task's memcg won't help us in this case.
> *
> * Respecting the original socket's memcg is a better
> * decision in this case.
> */
> if (sk->sk_memcg) {
> BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
>>>> css_get(&sk->sk_memcg->css);
> return;
> }
>
> Is it possible to reproduce the issue on an upstream kernel?
> Any ideas of what can trigger it?
>
> Btw, with the following patch applied (below) and cgroup v2 enabled,
> the issue, which I'm talking about, can be reproduced in seconds after reboot
> by doing almost any network activity. Just sshing to a machine is enough.
> The corresponding warning will be printed to dmesg.
>
> What is a proper way to fix the socket memory accounting in this case,
> what do you think?
>
> Thank you!
>
> Roman
>
> --
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c51c589..55fb890 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -276,6 +276,8 @@ struct mem_cgroup {
> struct list_head event_list;
> spinlock_t event_list_lock;
>
> + atomic_t tcpcnt;
> +
> struct mem_cgroup_per_node *nodeinfo[0];
> /* WARNING: nodeinfo must be the last member here */
> };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 19eea69..c69ff04 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5623,6 +5623,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
> seq_printf(m, "workingset_nodereclaim %lu\n",
> stat[WORKINGSET_NODERECLAIM]);
>
> + seq_printf(m, "tcpcnt %d\n", atomic_read(&memcg->tcpcnt));
> +
> return 0;
> }
>
> @@ -6139,6 +6141,8 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> {
> gfp_t gfp_mask = GFP_KERNEL;
>
> + atomic_add(nr_pages, &memcg->tcpcnt);
> +
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> struct page_counter *fail;
>
> @@ -6171,6 +6175,11 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> */
> void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> {
> + int v = atomic_sub_return(nr_pages, &memcg->tcpcnt);
> + if (v < 0) {
> + pr_info("@@@ %p %d \n", memcg, v);
> + }
> +
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> page_counter_uncharge(&memcg->tcpmem, nr_pages);
> return;