Re: Mysterious CFQ crash and RCU

From: Vivek Goyal
Date: Thu Aug 04 2011 - 15:52:03 EST


On Thu, Aug 04, 2011 at 09:43:27PM +0200, Jens Axboe wrote:
> On 2011-08-04 17:05, Vivek Goyal wrote:
> > On Sat, Jun 04, 2011 at 04:06:33PM -0700, Paul E. McKenney wrote:
> >> On Sun, Jun 05, 2011 at 12:48:05AM +0200, Paul Bolle wrote:
> >>> On Sat, 2011-06-04 at 09:03 -0700, Paul E. McKenney wrote:
> >>>> More like "based on these diagnostics, I see no evidence of the RCU
> >>>> implementation misbehaving." Which is of course different than "I can
> >>>> prove that the RCU implementation is not misbehaving". That said, the
> >>>> fact that you are running on a single CPU makes it hard for me to see
> >>>> any latitude for RCU-implementation misbehavior.
> >>>>
> >>>> Clearly something is wrong somewhere.
> >>>
> >>> Yes.
> >>>
> >>>> Given the fact that on a single-CPU
> >>>> system, synchronize_rcu() is a no-op, and given that you weren't able
> >>>> to reproduce with CONFIG_TREE_PREEMPT_RCU=y, my guess is that there is
> >>>> a synchronize_rcu() that occasionally (illegally) gets executed within
> >>>> an RCU read-side critical section.
> >>>
> >>> I think I finally found it!
> >>
> >> Very cool!!!
> >>
> >>> The culprit seems to be io_context.ioc_data (not the most clear of
> >>> names!). It seems to be a single entry "last-hit cache" of an hlist
> >>> called cic_list. (There are three, subtly different, cic_lists in the
> >>> CFQ code!) It is not entirely clear, but that last-hit cache can get out
> >>> of sync with the hlist it is supposed to cache. My guess it that every
> >>> now and then a member of the hlist gets deleted while it's still in that
> >>> (single entry) cache. If it then gets retrieved from that cache it
> >>> already points to poisoned memory. For some strange reason this only
> >>> results in an Oops if one or more debugging options are set (as are set
> >>> in the Fedora Rawhide non-stable kernels that I ran into this). I have
> >>> no clue whatsoever, why that is ...
> >>
> >> Oooh... If I understand you correctly, this is exactly analogous to my
> >> first-ever RCU bug way back in the early 90s. This was a hashed version
> >> of a BSD-style routing table, where each hash bucket maintained a pointer
> >> to the last routing entry used. When I deleted a routing entry, I forgot
> >> to check the last-entry pointer.
> >>
> >> The amazing thing is that the guy who found and fixed this bug generated
> >> a correct fix despite having absolutely no idea what RCU was or how it
> >> worked. All he knew was that if he waited for the sun to reach Oregon,
> >> his users were going to be way beyond upset. ;-)
> >>
> >>> Anyhow, after ripping out ioc_data this bug seems to have disappeared!
> >>> Jens, Vivek, could you please have a look at this? In the mean time I
> >>> hope to pinpoint this issue and draft a small patch to really solve it
> >>> (ie, not by simply ripping out ioc_data).
> >>
> >> Again, very cool!!!
> >
> > Hi Paul,
> >
> > Remember this CFQ issue where we suspected that we are freeing up an
> > object before rcu period is over. We thought we fixed it. Guess what,
> > it seems be to be back.
> >
> > **************************************************************************
> > [21265.163144] BUG: unable to handle kernel paging request at 6b6b6b43
> > [21265.164022] IP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched]
> > [21265.164022] *pde = 00000000
> > [21265.164022] Oops: 0000 [#1] SMP
> > [21265.164022] Modules linked in: cfq_iosched tcp_lp fuse arc4 ppdev
> > snd_intel8x0m snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device
> > microcode
> > snd_pcm joydev snd_timer ath5k pcspkr snd_page_alloc thinkpad_acpi
> > i2c_i801
> > ath mac80211 snd soundcore cfg80211 rfkill parport_pc parport e1000
> > iTCO_wd
> > t iTCO_vendor_support binfmt_misc uinput yenta_socket video radeon ttm
> > drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
> > [21265.164022]
> > [21265.164022] Pid: 18514, comm: mandb Not tainted
> > 3.0.0-1.local2.fc17.i686 #1
> > IBM /
> > [21265.164022] EIP: 0060:[<f7d55fe0>] EFLAGS: 00010286 CPU: 0
> > [21265.164022] EIP is at cic_free_func+0x10/0x6c [cfq_iosched]
> > [21265.164022] EAX: f242e400 EBX: 6b6b6b43 ECX: 00000246 EDX: 6b6b6b43
> > [21265.164022] ESI: f242e400 EDI: f7d55fd0 EBP: df92df04 ESP: df92def0
> > [21265.164022] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> > [21265.164022] Process mandb (pid: 18514, ti=df92c000 task=ed1554c0
> > task.ti=df92c000)
> > [21265.164022] Stack:
> > [21265.164022] c0468df7 c0af0cb0 6b6b6b6b f242e400 f7d55fd0 df92df18
> > f7d54510
> > f242e400
> > [21265.164022] ed1554c0 ed155898 df92df20 f7d54530 df92df2c c05cc8ba
> > f242e400
> > df92df40
> > [21265.164022] c05cc94a ed1554c0 00000012 df92df74 df92df8c c04423e2
> > eedf2b5c
> > eedf2b00
> > [21265.164022] Call Trace:
> > [21265.164022] [<c0468df7>] ? lock_is_held+0x73/0x7b
> > [21265.164022] [<f7d55fd0>] ? cfqq_process_refs+0x18/0x18 [cfq_iosched]
> > [21265.164022] [<f7d54510>] call_for_each_cic+0x3e/0x4c [cfq_iosched]
> > [21265.164022] [<f7d54530>] cfq_free_io_context+0x12/0x14 [cfq_iosched]
> > [21265.164022] [<c05cc8ba>] put_io_context+0x34/0x5c
> > [21265.164022] [<c05cc94a>] exit_io_context+0x68/0x6d
> > [21265.164022] [<c04423e2>] do_exit+0x62f/0x66a
> > [21265.164022] [<c0442634>] do_group_exit+0x63/0x86
> > [21265.164022] [<c044266f>] sys_exit_group+0x18/0x18
> > [21265.164022] [<c083b0df>] sysenter_do_call+0x12/0x38
> > [21265.164022] Code: 5f 5d c3 55 89 e5 3e 8d 74 26 00 8b 50 3c 03 50 40 8b
> > 00
> > 29 d0 79 02 0f 0b 5d c3 55 89 e5 57 56 53 83 ec 08 3e 8d 74 26 00 89 d3 <8
> > b> 12 89 c6 f6 c2 01 75 02 0f 0b 8d 78 08 89 f8 89 55 ec e8 c8
> > [21265.164022] EIP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched]
> > SS:ESP
> > 0068:df92def0
> > [21265.164022] CR2: 000000006b6b6b43
> > [21265.363749] ---[ end trace a779435da6d0a152 ]---
> >
> > **************************************************************************
> > Again details have been added to old bz.
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=577968
> >
> > Now two users have run into this issue. PaulB, again can reliably
> > reproduce this issue even with 3.0 kernels. He just needs to run
> > "mandb -c" on this machine.
> >
> > He has again run the CFQ debug V2 patch which seems to be telling that
> > cfq_io_context object was freed while in rcu read section.
> >
> > **************************************************************************
> > [ 562.598881] debug: Will exit io context cic=0xf0469230, Next pointer is
> > (null)
> > [ 562.598881] debug: cic=0xf0469230 exit io context done. Next pointer is
> > (null)
> > [ 562.598881] debug: Will decouple-free cic cic=0xf0469230, Next pointer
> > is (null)
> > [ 562.633403] debug: cic=0xf0469230 decouple-free cic done. Next pointer
> > is (null)
> > [ 562.644949] debug: freeing up cic=f0469230
> > [ 562.652971] cic=f0469230 freed while in RCU read lock
> >
> > *****************************************************************************
> >
> > I am hoping that I have written the debug patch right. As usual i am
> > running out of ideas on how to debug it further. Anything debug
> > information you would like to have a look at?
>
> Darn! Vivek, can you re-send the debug patch for inspection?

Sure Jens. Here it is.

Thanks
Vivek


---
block/cfq-iosched.c | 43 ++++++++++++++++++++++++++++++++++++++++---
1 file changed, 40 insertions(+), 3 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c 2011-05-18 18:06:11.499007946 -0400
+++ linux-2.6/block/cfq-iosched.c 2011-05-18 18:54:12.896660278 -0400
@@ -2581,6 +2581,17 @@ static void cfq_put_queue(struct cfq_que
cfq_put_cfqg(cfqg);
}

+static void cfq_free_io_context(struct io_context *ioc);
+static void cfq_exit_io_context(struct io_context *ioc);
+static void cfq_ioc_set_ioprio(struct io_context *ioc);
+static void cfq_ioc_set_cgroup(struct io_context *ioc);
+
+static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic);
+static void cfq_exit_single_io_context(struct io_context *ioc,
+ struct cfq_io_context *cic);
+static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic);
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic);
+
/*
* Call func for each cic attached to this ioc.
*/
@@ -2589,13 +2600,37 @@ call_for_each_cic(struct io_context *ioc
void (*func)(struct io_context *, struct cfq_io_context *))
{
struct cfq_io_context *cic;
- struct hlist_node *n;
+ struct hlist_node *n, *next;
+
+ char *action;
+
+
+ if ((void *) func == (void *) cic_free_func)
+ action = "decouple-free cic";
+ else if((void *) func == (void *) cfq_exit_single_io_context)
+ action = "exit io context";
+ else if((void *) func == (void *) changed_ioprio)
+ action = "change ioprio";
+ else if((void *)func == (void *)changed_cgroup)
+ action = "change cgroup";
+ else
+ action = "unknown";

rcu_read_lock();

- hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list)
+ hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list) {
+ printk("debug: Will %s cic=0x%p, Next pointer is %p\n",
+ action, cic, cic->cic_list.next);
func(ioc, cic);
-
+ printk("debug: cic=0x%p %s done. Next pointer is %p\n", cic,
+ action, cic->cic_list.next);
+ next = rcu_dereference_raw(hlist_next_rcu(&cic->cic_list));
+
+ if (next == (void *) 0x6b6b6b6b) {
+ printk("cic=%p freed while in RCU read lock\n", cic);
+ BUG();
+ }
+ }
rcu_read_unlock();
}

@@ -2605,6 +2640,7 @@ static void cfq_cic_free_rcu(struct rcu_

cic = container_of(head, struct cfq_io_context, rcu_head);

+ printk("debug: freeing up cic=%p\n", cic);
kmem_cache_free(cfq_ioc_pool, cic);
elv_ioc_count_dec(cfq_ioc_count);

@@ -3019,6 +3055,7 @@ cfq_drop_dead_cic(struct cfq_data *cfqd,
hlist_del_rcu(&cic->cic_list);
spin_unlock_irqrestore(&ioc->lock, flags);

+ printk("debug: Dropping dead cic=%p\n", cic);
cfq_cic_free(cic);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/