Re: [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks

From: Steven Rostedt
Date: Thu Mar 17 2005 - 04:27:58 EST




On Wed, 16 Mar 2005, Andrew Morton wrote:

> > I guess one way to solve this is to add a wait queue here (before
> > schedule()), and have the one holding the lock to wake up all on the
> > waitqueue when they release it.
>
> yup. A patch against mainline would be appropriate, please.
>

Hi Andrew,

Here's the patch against 2.6.11. I tested it, by adding (after making the
patch) global spinlocks for jbd_lock_bh_state and jbd_lock_bh_journalhead.
That way I have same scenerio as with Ingo's kernel, and I turned on
NEED_JOURNAL_STATE_WAIT. I'm still running that kernel so it looks like
it works. Making those two locks global causes this deadlock on kjournal
much quicker, and I don't need to run on an SMP machine (since my SMP
machines are currently being used for other tasks).

Some comments on my patch. I only implement the wait queue when
bit_spin_trylock is an actual lock (thus creating the problem). I didn't
want to add this code if it was needed (ie. !(CONFIG_SMP &&
CONFIG_DEBUG_SPINLOCKS)). So in bit_spin_trylock, I define
NEED_JOURNAL_STATE_WAIT if bit_spin_trylock is really a lock. When
NEED_JOURNAL_STATE_WAIT is set, then the wait queue is set up in the
journal code.

Now the question is, should we make those two locks global? It would help
Ingo's cause (and mine as well). But I don't know the impact on a large
SMP configuration. Andrew, since you have a 16xSMP machine, could you (if
you have time) try out the effect of that. If you do have time, then I'll
send you a patch that goes on top of this one to change the two locks into
global spin locks.

Ingo, where do you want to go from here? I guess we need to wait on what
Andrew decides.

-- Steve


diff -ur linux-2.6.11.orig/fs/jbd/commit.c linux-2.6.11/fs/jbd/commit.c
--- linux-2.6.11.orig/fs/jbd/commit.c 2005-03-02 02:38:25.000000000 -0500
+++ linux-2.6.11/fs/jbd/commit.c 2005-03-17 03:40:06.000000000 -0500
@@ -80,15 +80,33 @@

/*
* Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is
- * held. For ranking reasons we must trylock. If we lose, schedule away and
- * return 0. j_list_lock is dropped in this case.
+ * held. For ranking reasons we must trylock. If we lose put ourselves on a
+ * state wait queue and we'll be woken up when it is unlocked. Then we return
+ * 0 to try this again. j_list_lock is dropped in this case.
*/
static int inverted_lock(journal_t *journal, struct buffer_head *bh)
{
if (!jbd_trylock_bh_state(bh)) {
+ /*
+ * jbd_trylock_bh_state always returns true unless CONFIG_SMP or
+ * CONFIG_DEBUG_SPINLOCK, so the wait queue is not needed there.
+ * The bit_spin_locks in jbd_lock_bh_state need to be removed anyway.
+ */
+#ifdef NEED_JOURNAL_STATE_WAIT
+ DECLARE_WAITQUEUE(wait, current);
spin_unlock(&journal->j_list_lock);
- schedule();
+ add_wait_queue_exclusive(&journal_state_wait,&wait);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ /* Check to see if the lock has been unlocked in this short time */
+ if (jbd_is_locked_bh_state(bh))
+ schedule();
+ set_current_state(TASK_RUNNING);
+ remove_wait_queue(&journal_state_wait,&wait);
return 0;
+#else
+ /* This should never be hit */
+ BUG();
+#endif
}
return 1;
}
diff -ur linux-2.6.11.orig/fs/jbd/journal.c linux-2.6.11/fs/jbd/journal.c
--- linux-2.6.11.orig/fs/jbd/journal.c 2005-03-02 02:37:49.000000000 -0500
+++ linux-2.6.11/fs/jbd/journal.c 2005-03-17 03:47:40.000000000 -0500
@@ -80,6 +80,11 @@
EXPORT_SYMBOL(journal_try_to_free_buffers);
EXPORT_SYMBOL(journal_force_commit);

+#ifdef NEED_JOURNAL_STATE_WAIT
+EXPORT_SYMBOL(journal_state_wait);
+DECLARE_WAIT_QUEUE_HEAD(journal_state_wait);
+#endif
+
static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);

/*
diff -ur linux-2.6.11.orig/include/linux/jbd.h linux-2.6.11/include/linux/jbd.h
--- linux-2.6.11.orig/include/linux/jbd.h 2005-03-02 02:38:19.000000000 -0500
+++ linux-2.6.11/include/linux/jbd.h 2005-03-17 03:48:18.000000000 -0500
@@ -324,6 +324,20 @@
return bh->b_private;
}

+#ifdef NEED_JOURNAL_STATE_WAIT
+/*
+ * The journal_state_wait is a wait queue that tasks will wait on
+ * if they fail to get the jbd_lock_bh_state while holding the j_list_lock.
+ * Instead of spinning on schedule, the task now adds itself to this wait queue
+ * and will be woken up when the jbd_lock_bh_state is released.
+ *
+ * Since the bit_spin_locks are only locks under CONFIG_SMP and
+ * CONFIG_DEBUG_SPINLOCK, this wait queue is only needed in those
+ * cases.
+ */
+extern wait_queue_head_t journal_state_wait;
+#endif
+
static inline void jbd_lock_bh_state(struct buffer_head *bh)
{
bit_spin_lock(BH_State, &bh->b_state);
@@ -342,6 +356,13 @@
static inline void jbd_unlock_bh_state(struct buffer_head *bh)
{
bit_spin_unlock(BH_State, &bh->b_state);
+#ifdef NEED_JOURNAL_STATE_WAIT
+ /*
+ * There may be a task sleeping, and waiting to be woken up
+ * when this is unlocked.
+ */
+ wake_up(&journal_state_wait);
+#endif
}

static inline void jbd_lock_bh_journal_head(struct buffer_head *bh)
diff -ur linux-2.6.11.orig/include/linux/spinlock.h linux-2.6.11/include/linux/spinlock.h
--- linux-2.6.11.orig/include/linux/spinlock.h 2005-03-02 02:38:09.000000000 -0500
+++ linux-2.6.11/include/linux/spinlock.h 2005-03-17 03:39:13.024466071 -0500
@@ -527,6 +527,9 @@
*
* Don't use this unless you really need to: spin_lock() and spin_unlock()
* are significantly faster.
+ *
+ * FIXME: These are evil and need to be removed. They are currently only
+ * used by the journal code of ext3.
*/
static inline void bit_spin_lock(int bitnum, unsigned long *addr)
{
@@ -557,6 +560,13 @@
{
preempt_disable();
#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+ /*
+ * This is only used by the journal code of ext3 and if this
+ * is set then we need to tell the journal code that it needs
+ * a wait queue to keep kjournald from spinning on a lock.
+ */
+#define NEED_JOURNAL_STATE_WAIT
+
if (test_and_set_bit(bitnum, addr)) {
preempt_enable();
return 0;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/