Re: Assertion failure in log_do_checkpoint(): "drop_count != 0 || cleanup_ret != 0"

From: Jan Kara
Date: Fri May 06 2005 - 05:27:31 EST


> ons 2005-05-04 klockan 21:43 +0400 skrev cron:
> > Got a 2.6.12-rc3 spontaneous reboot on i386 SMP after about 20h of
> > moderate web serving. 2.6.11-rc1-bk6 works well for a week or more but
> > leaks memory at about 16 Mb/day and needs reboot thereafter. :(
> >
> > Can you please give advice how to interpret this log:
>
> I'm quite certain I've hit this twice today. Can't be 100% sure but the
> bottom of the traces looked just like this and RIP was at
> log_do_checkpoint(). Once sporadically and another time trying to hit
> it. The workload both times was having two separate kernel trees, both
> were 2.6.12-rc2-mm something and then in both of the trees I did ketchup
> 2.6.12-rc3-mm3 in both trees more or less at the same time.
Could you please try the attached patch? It should fix the bug.
Stephen has thought off a better long-term fix but was busy with other
things last weeks so I didn't get to coding it yet...

> So then I ran two concurrent while loops ketchuping trees. Something
> like:
> #!/bin/sh
>
> while true
> do
> ~/bin/ketchup 2.6.11-rc3-mm2
> ~/bin/ketchup 2.6-mm
> ~/bin/ketchup 2.6.11
> ~/bin/ketchup 2.6.11-rc1
> done
>
>
> It took less than an hour to purposefully hit the bug with this here. No
> guarantees it will happen consistently however...
>
> This is a dual cpu x64 with preempt running 2.6.12-rc3. I'm
> attaching .config if it is of interest to anyone.

Honza
--
Jan Kara <jack@xxxxxxx>
SuSE CR Labs
Fix possible false assertion failure in JBD checkpointing code.

Signed-off-by: Jan Kara <jack@xxxxxxx>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.12-rc2/fs/jbd/checkpoint.c linux-2.6.12-rc2-checkpoint/fs/jbd/checkpoint.c
--- linux-2.6.12-rc2/fs/jbd/checkpoint.c 2005-03-03 18:58:29.000000000 +0100
+++ linux-2.6.12-rc2-checkpoint/fs/jbd/checkpoint.c 2005-04-05 13:26:42.000000000 +0200
@@ -339,8 +339,10 @@ int log_do_checkpoint(journal_t *journal
}
} while (jh != last_jh && !retry);

- if (batch_count)
+ if (batch_count) {
__flush_batch(journal, bhs, &batch_count);
+ retry = 1;
+ }

/*
* If someone cleaned up this transaction while we slept, we're