[tree] latest kill-the-BKL tree, v12

From: Ingo Molnar
Date: Tue Apr 14 2009 - 05:02:28 EST



* Alexander Beregalov <a.beregalov@xxxxxxxxx> wrote:

> On Tue, Apr 14, 2009 at 05:34:22AM +0200, Frederic Weisbecker wrote:
> > Ingo,
> >
> > This small patchset fixes some deadlocks I've faced after trying
> > some pressures with dbench on a reiserfs partition.
> >
> > There is still some work pending such as adding some checks to ensure we
> > _always_ release the lock before sleeping, as you suggested.
> > Also I have to fix a lockdep warning reported by Alessio Igor Bogani.
> > And also some optimizations....
> >
> > Thanks,
> > Frederic.
> >
> > Frederic Weisbecker (3):
> > kill-the-BKL/reiserfs: provide a tool to lock only once the write lock
> > kill-the-BKL/reiserfs: lock only once in reiserfs_truncate_file
> > kill-the-BKL/reiserfs: only acquire the write lock once in
> > reiserfs_dirty_inode
> >
> > fs/reiserfs/inode.c | 10 +++++++---
> > fs/reiserfs/lock.c | 26 ++++++++++++++++++++++++++
> > fs/reiserfs/super.c | 15 +++++++++------
> > include/linux/reiserfs_fs.h | 2 ++
> > 4 files changed, 44 insertions(+), 9 deletions(-)
> >
>
> Hi
>
> The same test - dbench on reiserfs on loop on sparc64.
>
> [ INFO: possible circular locking dependency detected ]
> 2.6.30-rc1-00457-gb21597d-dirty #2

I'm wondering ... your version hash suggests you used vanilla
upstream as a base for your test. There's a string of other fixes
from Frederic in tip:core/kill-the-BKL branch, have you picked them
all up when you did your testing?

The most coherent way to test this would be to pick up the latest
core/kill-the-BKL git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git core/kill-the-BKL

Or you can also try the combo patch below (against latest mainline).
The tree already includes the latest 3 fixes from Frederic as well,
so it should be a one-stop-shop.

Thanks,

Ingo

------------------>
Alessio Igor Bogani (17):
remove the BKL: Remove BKL from tracer registration
drivers/char/generic_nvram.c: Replace the BKL with a mutex
isofs: Remove BKL
kernel/sys.c: Replace the BKL with a mutex
sound/oss/au1550_ac97.c: Remove BKL
sound/oss/soundcard.c: Use &inode->i_mutex instead of the BKL
sound/sound_core.c: Use &inode->i_mutex instead of the BKL
drivers/bluetooth/hci_vhci.c: Use &inode->i_mutex instead of the BKL
sound/oss/vwsnd.c: Remove BKL
sound/core/sound.c: Use &inode->i_mutex instead of the BKL
drivers/char/nvram.c: Remove BKL
sound/oss/msnd_pinnacle.c: Use &inode->i_mutex instead of the BKL
drivers/char/nvram.c: Use &inode->i_mutex instead of the BKL
sound/core/info.c: Use &inode->i_mutex instead of the BKL
sound/oss/dmasound/dmasound_core.c: Use &inode->i_mutex instead of the BKL
remove the BKL: remove "BKL auto-drop" assumption from svc_recv()
remove the BKL: remove "BKL auto-drop" assumption from nfs3_rpc_wrapper()

Frederic Weisbecker (6):
reiserfs: kill-the-BKL
kill-the-BKL: fix missing #include smp_lock.h
reiserfs, kill-the-BKL: fix unsafe j_flush_mutex lock
kill-the-BKL/reiserfs: provide a tool to lock only once the write lock
kill-the-BKL/reiserfs: lock only once in reiserfs_truncate_file
kill-the-BKL/reiserfs: only acquire the write lock once in reiserfs_dirty_inode

Ingo Molnar (21):
revert ("BKL: revert back to the old spinlock implementation")
remove the BKL: change get_fs_type() BKL dependency
remove the BKL: reduce BKL locking during bootup
remove the BKL: restruct ->bd_mutex and BKL dependency
remove the BKL: change ext3 BKL assumption
remove the BKL: reduce misc_open() BKL dependency
remove the BKL: remove "BKL auto-drop" assumption from vt_waitactive()
remove the BKL: remove it from the core kernel!
softlockup helper: print BKL owner
remove the BKL: flush_workqueue() debug helper & fix
remove the BKL: tty updates
remove the BKL: lockdep self-test fix
remove the BKL: request_module() debug helper
remove the BKL: procfs debug helper and BKL elimination
remove the BKL: do not take the BKL in init code
remove the BKL: restructure NFS code
tty: fix BKL related leak and crash
remove the BKL: fix UP build
remove the BKL: use the BKL mutex on !SMP too
remove the BKL: merge fix
remove the BKL: fix build in fs/proc/generic.c


arch/mn10300/Kconfig | 11 +++
drivers/bluetooth/hci_vhci.c | 15 ++--
drivers/char/generic_nvram.c | 10 ++-
drivers/char/misc.c | 8 ++
drivers/char/nvram.c | 11 +--
drivers/char/tty_ldisc.c | 14 +++-
drivers/char/vt_ioctl.c | 8 ++
fs/block_dev.c | 4 +-
fs/ext3/super.c | 4 -
fs/filesystems.c | 14 ++++
fs/isofs/dir.c | 3 -
fs/isofs/inode.c | 4 -
fs/isofs/namei.c | 3 -
fs/isofs/rock.c | 3 -
fs/nfs/nfs3proc.c | 7 ++
fs/proc/generic.c | 7 ++-
fs/proc/root.c | 2 +
fs/reiserfs/Makefile | 2 +-
fs/reiserfs/bitmap.c | 2 +
fs/reiserfs/dir.c | 8 ++
fs/reiserfs/fix_node.c | 10 +++
fs/reiserfs/inode.c | 33 ++++++--
fs/reiserfs/ioctl.c | 6 +-
fs/reiserfs/journal.c | 136 +++++++++++++++++++++++++++--------
fs/reiserfs/lock.c | 89 ++++++++++++++++++++++
fs/reiserfs/resize.c | 2 +
fs/reiserfs/stree.c | 2 +
fs/reiserfs/super.c | 56 ++++++++++++--
include/linux/hardirq.h | 18 ++---
include/linux/reiserfs_fs.h | 14 ++-
include/linux/reiserfs_fs_sb.h | 9 ++
include/linux/smp_lock.h | 36 ++-------
init/Kconfig | 5 -
init/main.c | 7 +-
kernel/fork.c | 4 +
kernel/hung_task.c | 3 +
kernel/kmod.c | 22 ++++++
kernel/sched.c | 16 +----
kernel/softlockup.c | 1 +
kernel/sys.c | 15 ++--
kernel/trace/trace.c | 8 --
kernel/workqueue.c | 13 +++
lib/Makefile | 3 +-
lib/kernel_lock.c | 142 ++++++++++--------------------------
net/sunrpc/sched.c | 6 ++
net/sunrpc/svc_xprt.c | 13 +++
sound/core/info.c | 6 +-
sound/core/sound.c | 5 +-
sound/oss/au1550_ac97.c | 7 --
sound/oss/dmasound/dmasound_core.c | 14 ++--
sound/oss/msnd_pinnacle.c | 6 +-
sound/oss/soundcard.c | 33 +++++----
sound/oss/vwsnd.c | 3 -
sound/sound_core.c | 6 +-
54 files changed, 571 insertions(+), 318 deletions(-)
create mode 100644 fs/reiserfs/lock.c

diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig
index 3559267..adeae17 100644
--- a/arch/mn10300/Kconfig
+++ b/arch/mn10300/Kconfig
@@ -186,6 +186,17 @@ config PREEMPT
Say Y here if you are building a kernel for a desktop, embedded
or real-time system. Say N if you are unsure.

+config PREEMPT_BKL
+ bool "Preempt The Big Kernel Lock"
+ depends on PREEMPT
+ default y
+ help
+ This option reduces the latency of the kernel by making the
+ big kernel lock preemptible.
+
+ Say Y here if you are building a kernel for a desktop system.
+ Say N if you are unsure.
+
config MN10300_CURRENT_IN_E2
bool "Hold current task address in E2 register"
default y
diff --git a/drivers/bluetooth/hci_vhci.c b/drivers/bluetooth/hci_vhci.c
index 0bbefba..28b0cb9 100644
--- a/drivers/bluetooth/hci_vhci.c
+++ b/drivers/bluetooth/hci_vhci.c
@@ -28,7 +28,7 @@
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/slab.h>
-#include <linux/smp_lock.h>
+#include <linux/mutex.h>
#include <linux/types.h>
#include <linux/errno.h>
#include <linux/sched.h>
@@ -259,11 +259,11 @@ static int vhci_open(struct inode *inode, struct file *file)
skb_queue_head_init(&data->readq);
init_waitqueue_head(&data->read_wait);

- lock_kernel();
+ mutex_lock(&inode->i_mutex);
hdev = hci_alloc_dev();
if (!hdev) {
kfree(data);
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return -ENOMEM;
}

@@ -284,12 +284,12 @@ static int vhci_open(struct inode *inode, struct file *file)
BT_ERR("Can't register HCI device");
kfree(data);
hci_free_dev(hdev);
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return -EBUSY;
}

file->private_data = data;
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);

return nonseekable_open(inode, file);
}
@@ -312,10 +312,11 @@ static int vhci_release(struct inode *inode, struct file *file)

static int vhci_fasync(int fd, struct file *file, int on)
{
+ struct inode *inode = file->f_path.dentry->d_inode;
struct vhci_data *data = file->private_data;
int err = 0;

- lock_kernel();
+ mutex_lock(&inode->i_mutex);
err = fasync_helper(fd, file, on, &data->fasync);
if (err < 0)
goto out;
@@ -326,7 +327,7 @@ static int vhci_fasync(int fd, struct file *file, int on)
data->flags &= ~VHCI_FASYNC;

out:
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return err;
}

diff --git a/drivers/char/generic_nvram.c b/drivers/char/generic_nvram.c
index a00869c..95d2653 100644
--- a/drivers/char/generic_nvram.c
+++ b/drivers/char/generic_nvram.c
@@ -19,7 +19,7 @@
#include <linux/miscdevice.h>
#include <linux/fcntl.h>
#include <linux/init.h>
-#include <linux/smp_lock.h>
+#include <linux/mutex.h>
#include <asm/uaccess.h>
#include <asm/nvram.h>
#ifdef CONFIG_PPC_PMAC
@@ -28,9 +28,11 @@

#define NVRAM_SIZE 8192

+static DEFINE_MUTEX(nvram_lock);
+
static loff_t nvram_llseek(struct file *file, loff_t offset, int origin)
{
- lock_kernel();
+ mutex_lock(&nvram_lock);
switch (origin) {
case 1:
offset += file->f_pos;
@@ -40,11 +42,11 @@ static loff_t nvram_llseek(struct file *file, loff_t offset, int origin)
break;
}
if (offset < 0) {
- unlock_kernel();
+ mutex_unlock(&nvram_lock);
return -EINVAL;
}
file->f_pos = offset;
- unlock_kernel();
+ mutex_unlock(&nvram_lock);
return file->f_pos;
}

diff --git a/drivers/char/misc.c b/drivers/char/misc.c
index a5e0db9..8194880 100644
--- a/drivers/char/misc.c
+++ b/drivers/char/misc.c
@@ -36,6 +36,7 @@
#include <linux/module.h>

#include <linux/fs.h>
+#include <linux/smp_lock.h>
#include <linux/errno.h>
#include <linux/miscdevice.h>
#include <linux/kernel.h>
@@ -130,8 +131,15 @@ static int misc_open(struct inode * inode, struct file * file)
}

if (!new_fops) {
+ int bkl = kernel_locked();
+
mutex_unlock(&misc_mtx);
+ if (bkl)
+ unlock_kernel();
request_module("char-major-%d-%d", MISC_MAJOR, minor);
+ if (bkl)
+ lock_kernel();
+
mutex_lock(&misc_mtx);

list_for_each_entry(c, &misc_list, list) {
diff --git a/drivers/char/nvram.c b/drivers/char/nvram.c
index 88cee40..bc6220b 100644
--- a/drivers/char/nvram.c
+++ b/drivers/char/nvram.c
@@ -38,7 +38,7 @@
#define NVRAM_VERSION "1.3"

#include <linux/module.h>
-#include <linux/smp_lock.h>
+#include <linux/mutex.h>
#include <linux/nvram.h>

#define PC 1
@@ -214,7 +214,9 @@ void nvram_set_checksum(void)

static loff_t nvram_llseek(struct file *file, loff_t offset, int origin)
{
- lock_kernel();
+ struct inode *inode = file->f_path.dentry->d_inode;
+
+ mutex_lock(&inode->i_mutex);
switch (origin) {
case 0:
/* nothing to do */
@@ -226,7 +228,7 @@ static loff_t nvram_llseek(struct file *file, loff_t offset, int origin)
offset += NVRAM_BYTES;
break;
}
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return (offset >= 0) ? (file->f_pos = offset) : -EINVAL;
}

@@ -331,14 +333,12 @@ static int nvram_ioctl(struct inode *inode, struct file *file,

static int nvram_open(struct inode *inode, struct file *file)
{
- lock_kernel();
spin_lock(&nvram_state_lock);

if ((nvram_open_cnt && (file->f_flags & O_EXCL)) ||
(nvram_open_mode & NVRAM_EXCL) ||
((file->f_mode & FMODE_WRITE) && (nvram_open_mode & NVRAM_WRITE))) {
spin_unlock(&nvram_state_lock);
- unlock_kernel();
return -EBUSY;
}

@@ -349,7 +349,6 @@ static int nvram_open(struct inode *inode, struct file *file)
nvram_open_cnt++;

spin_unlock(&nvram_state_lock);
- unlock_kernel();

return 0;
}
diff --git a/drivers/char/tty_ldisc.c b/drivers/char/tty_ldisc.c
index f78f5b0..1e20212 100644
--- a/drivers/char/tty_ldisc.c
+++ b/drivers/char/tty_ldisc.c
@@ -659,9 +659,19 @@ void tty_ldisc_release(struct tty_struct *tty, struct tty_struct *o_tty)

/*
* Wait for ->hangup_work and ->buf.work handlers to terminate
+ *
+ * It's safe to drop/reacquire the BKL here as
+ * flush_scheduled_work() can sleep anyway:
*/
-
- flush_scheduled_work();
+ {
+ int bkl = kernel_locked();
+
+ if (bkl)
+ unlock_kernel();
+ flush_scheduled_work();
+ if (bkl)
+ lock_kernel();
+ }

/*
* Wait for any short term users (we know they are just driver
diff --git a/drivers/char/vt_ioctl.c b/drivers/char/vt_ioctl.c
index a2dee0e..181ff38 100644
--- a/drivers/char/vt_ioctl.c
+++ b/drivers/char/vt_ioctl.c
@@ -1178,8 +1178,12 @@ static DECLARE_WAIT_QUEUE_HEAD(vt_activate_queue);
int vt_waitactive(int vt)
{
int retval;
+ int bkl = kernel_locked();
DECLARE_WAITQUEUE(wait, current);

+ if (bkl)
+ unlock_kernel();
+
add_wait_queue(&vt_activate_queue, &wait);
for (;;) {
retval = 0;
@@ -1205,6 +1209,10 @@ int vt_waitactive(int vt)
}
remove_wait_queue(&vt_activate_queue, &wait);
__set_current_state(TASK_RUNNING);
+
+ if (bkl)
+ lock_kernel();
+
return retval;
}

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f45dbc1..e262527 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1318,8 +1318,8 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
struct gendisk *disk = bdev->bd_disk;
struct block_device *victim = NULL;

- mutex_lock_nested(&bdev->bd_mutex, for_part);
lock_kernel();
+ mutex_lock_nested(&bdev->bd_mutex, for_part);
if (for_part)
bdev->bd_part_count--;

@@ -1344,8 +1344,8 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
victim = bdev->bd_contains;
bdev->bd_contains = NULL;
}
- unlock_kernel();
mutex_unlock(&bdev->bd_mutex);
+ unlock_kernel();
bdput(bdev);
if (victim)
__blkdev_put(victim, mode, 1);
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 599dbfe..dc905f9 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1585,8 +1585,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
sbi->s_resgid = EXT3_DEF_RESGID;
sbi->s_sb_block = sb_block;

- unlock_kernel();
-
blocksize = sb_min_blocksize(sb, EXT3_MIN_BLOCK_SIZE);
if (!blocksize) {
printk(KERN_ERR "EXT3-fs: unable to set blocksize\n");
@@ -1993,7 +1991,6 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
"writeback");

- lock_kernel();
return 0;

cantfind_ext3:
@@ -2022,7 +2019,6 @@ failed_mount:
out_fail:
sb->s_fs_info = NULL;
kfree(sbi);
- lock_kernel();
return ret;
}

diff --git a/fs/filesystems.c b/fs/filesystems.c
index 1aa7026..1e8b492 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -13,7 +13,9 @@
#include <linux/slab.h>
#include <linux/kmod.h>
#include <linux/init.h>
+#include <linux/smp_lock.h>
#include <linux/module.h>
+
#include <asm/uaccess.h>

/*
@@ -256,12 +258,24 @@ module_init(proc_filesystems_init);
static struct file_system_type *__get_fs_type(const char *name, int len)
{
struct file_system_type *fs;
+ int bkl = kernel_locked();
+
+ /*
+ * We request a module that might trigger user-space
+ * tasks. So explicitly drop the BKL here:
+ */
+ if (bkl)
+ unlock_kernel();

read_lock(&file_systems_lock);
fs = *(find_filesystem(name, len));
if (fs && !try_module_get(fs->owner))
fs = NULL;
read_unlock(&file_systems_lock);
+
+ if (bkl)
+ lock_kernel();
+
return fs;
}

diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c
index 2f0dc5a..263a697 100644
--- a/fs/isofs/dir.c
+++ b/fs/isofs/dir.c
@@ -10,7 +10,6 @@
*
* isofs directory handling functions
*/
-#include <linux/smp_lock.h>
#include "isofs.h"

int isofs_name_translate(struct iso_directory_record *de, char *new, struct inode *inode)
@@ -260,13 +259,11 @@ static int isofs_readdir(struct file *filp,
if (tmpname == NULL)
return -ENOMEM;

- lock_kernel();
tmpde = (struct iso_directory_record *) (tmpname+1024);

result = do_isofs_readdir(inode, filp, dirent, filldir, tmpname, tmpde);

free_page((unsigned long) tmpname);
- unlock_kernel();
return result;
}

diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index b4cbe96..708bbc7 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -17,7 +17,6 @@
#include <linux/slab.h>
#include <linux/nls.h>
#include <linux/ctype.h>
-#include <linux/smp_lock.h>
#include <linux/statfs.h>
#include <linux/cdrom.h>
#include <linux/parser.h>
@@ -955,8 +954,6 @@ int isofs_get_blocks(struct inode *inode, sector_t iblock_s,
int section, rv, error;
struct iso_inode_info *ei = ISOFS_I(inode);

- lock_kernel();
-
error = -EIO;
rv = 0;
if (iblock < 0 || iblock != iblock_s) {
@@ -1032,7 +1029,6 @@ int isofs_get_blocks(struct inode *inode, sector_t iblock_s,

error = 0;
abort:
- unlock_kernel();
return rv != 0 ? rv : error;
}

diff --git a/fs/isofs/namei.c b/fs/isofs/namei.c
index 8299889..36d6545 100644
--- a/fs/isofs/namei.c
+++ b/fs/isofs/namei.c
@@ -176,7 +176,6 @@ struct dentry *isofs_lookup(struct inode *dir, struct dentry *dentry, struct nam
if (!page)
return ERR_PTR(-ENOMEM);

- lock_kernel();
found = isofs_find_entry(dir, dentry,
&block, &offset,
page_address(page),
@@ -187,10 +186,8 @@ struct dentry *isofs_lookup(struct inode *dir, struct dentry *dentry, struct nam
if (found) {
inode = isofs_iget(dir->i_sb, block, offset);
if (IS_ERR(inode)) {
- unlock_kernel();
return ERR_CAST(inode);
}
}
- unlock_kernel();
return d_splice_alias(inode, dentry);
}
diff --git a/fs/isofs/rock.c b/fs/isofs/rock.c
index c2fb2dd..c3a883b 100644
--- a/fs/isofs/rock.c
+++ b/fs/isofs/rock.c
@@ -679,7 +679,6 @@ static int rock_ridge_symlink_readpage(struct file *file, struct page *page)

init_rock_state(&rs, inode);
block = ei->i_iget5_block;
- lock_kernel();
bh = sb_bread(inode->i_sb, block);
if (!bh)
goto out_noread;
@@ -749,7 +748,6 @@ repeat:
goto fail;
brelse(bh);
*rpnt = '\0';
- unlock_kernel();
SetPageUptodate(page);
kunmap(page);
unlock_page(page);
@@ -766,7 +764,6 @@ out_bad_span:
printk("symlink spans iso9660 blocks\n");
fail:
brelse(bh);
- unlock_kernel();
error:
SetPageError(page);
kunmap(page);
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index d0cc5ce..d91047c 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -17,6 +17,7 @@
#include <linux/nfs_page.h>
#include <linux/lockd/bind.h>
#include <linux/nfs_mount.h>
+#include <linux/smp_lock.h>

#include "iostat.h"
#include "internal.h"
@@ -28,11 +29,17 @@ static int
nfs3_rpc_wrapper(struct rpc_clnt *clnt, struct rpc_message *msg, int flags)
{
int res;
+ int bkl = kernel_locked();
+
do {
res = rpc_call_sync(clnt, msg, flags);
if (res != -EJUKEBOX)
break;
+ if (bkl)
+ unlock_kernel();
schedule_timeout_killable(NFS_JUKEBOX_RETRY_TIME);
+ if (bkl)
+ lock_kernel();
res = -ERESTARTSYS;
} while (!fatal_signal_pending(current));
return res;
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index fa678ab..d472853 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -20,6 +20,7 @@
#include <linux/bitops.h>
#include <linux/spinlock.h>
#include <linux/completion.h>
+#include <linux/smp_lock.h>
#include <asm/uaccess.h>

#include "internal.h"
@@ -526,7 +527,7 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *filp, void *dirent,
}
ret = 1;
out:
- return ret;
+ return ret;
}

int proc_readdir(struct file *filp, void *dirent, filldir_t filldir)
@@ -707,6 +708,8 @@ struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
struct proc_dir_entry *ent;
nlink_t nlink;

+ WARN_ON_ONCE(kernel_locked());
+
if (S_ISDIR(mode)) {
if ((mode & S_IALLUGO) == 0)
mode |= S_IRUGO | S_IXUGO;
@@ -737,6 +740,8 @@ struct proc_dir_entry *proc_create_data(const char *name, mode_t mode,
struct proc_dir_entry *pde;
nlink_t nlink;

+ WARN_ON_ONCE(kernel_locked());
+
if (S_ISDIR(mode)) {
if ((mode & S_IALLUGO) == 0)
mode |= S_IRUGO | S_IXUGO;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 1e15a2b..702d32d 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -164,8 +164,10 @@ static int proc_root_readdir(struct file * filp,

if (nr < FIRST_PROCESS_ENTRY) {
int error = proc_readdir(filp, dirent, filldir);
+
if (error <= 0)
return error;
+
filp->f_pos = FIRST_PROCESS_ENTRY;
}

diff --git a/fs/reiserfs/Makefile b/fs/reiserfs/Makefile
index 7c5ab63..6a9e30c 100644
--- a/fs/reiserfs/Makefile
+++ b/fs/reiserfs/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_REISERFS_FS) += reiserfs.o
reiserfs-objs := bitmap.o do_balan.o namei.o inode.o file.o dir.o fix_node.o \
super.o prints.o objectid.o lbalance.o ibalance.o stree.o \
hashes.o tail_conversion.o journal.o resize.o \
- item_ops.o ioctl.o procfs.o xattr.o
+ item_ops.o ioctl.o procfs.o xattr.o lock.o

ifeq ($(CONFIG_REISERFS_FS_XATTR),y)
reiserfs-objs += xattr_user.o xattr_trusted.o
diff --git a/fs/reiserfs/bitmap.c b/fs/reiserfs/bitmap.c
index e716161..1470334 100644
--- a/fs/reiserfs/bitmap.c
+++ b/fs/reiserfs/bitmap.c
@@ -1256,7 +1256,9 @@ struct buffer_head *reiserfs_read_bitmap_block(struct super_block *sb,
else {
if (buffer_locked(bh)) {
PROC_INFO_INC(sb, scan_bitmap.wait);
+ reiserfs_write_unlock(sb);
__wait_on_buffer(bh);
+ reiserfs_write_lock(sb);
}
BUG_ON(!buffer_uptodate(bh));
BUG_ON(atomic_read(&bh->b_count) == 0);
diff --git a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c
index 67a80d7..6d71aa0 100644
--- a/fs/reiserfs/dir.c
+++ b/fs/reiserfs/dir.c
@@ -174,14 +174,22 @@ int reiserfs_readdir_dentry(struct dentry *dentry, void *dirent,
// user space buffer is swapped out. At that time
// entry can move to somewhere else
memcpy(local_buf, d_name, d_reclen);
+
+ /*
+ * Since filldir might sleep, we can release
+ * the write lock here for other waiters
+ */
+ reiserfs_write_unlock(inode->i_sb);
if (filldir
(dirent, local_buf, d_reclen, d_off, d_ino,
DT_UNKNOWN) < 0) {
+ reiserfs_write_lock(inode->i_sb);
if (local_buf != small_buf) {
kfree(local_buf);
}
goto end;
}
+ reiserfs_write_lock(inode->i_sb);
if (local_buf != small_buf) {
kfree(local_buf);
}
diff --git a/fs/reiserfs/fix_node.c b/fs/reiserfs/fix_node.c
index 5e5a4e6..bf5f2cb 100644
--- a/fs/reiserfs/fix_node.c
+++ b/fs/reiserfs/fix_node.c
@@ -1022,7 +1022,11 @@ static int get_far_parent(struct tree_balance *tb,
/* Check whether the common parent is locked. */

if (buffer_locked(*pcom_father)) {
+
+ /* Release the write lock while the buffer is busy */
+ reiserfs_write_unlock(tb->tb_sb);
__wait_on_buffer(*pcom_father);
+ reiserfs_write_lock(tb->tb_sb);
if (FILESYSTEM_CHANGED_TB(tb)) {
brelse(*pcom_father);
return REPEAT_SEARCH;
@@ -1927,7 +1931,9 @@ static int get_direct_parent(struct tree_balance *tb, int h)
return REPEAT_SEARCH;

if (buffer_locked(bh)) {
+ reiserfs_write_unlock(tb->tb_sb);
__wait_on_buffer(bh);
+ reiserfs_write_lock(tb->tb_sb);
if (FILESYSTEM_CHANGED_TB(tb))
return REPEAT_SEARCH;
}
@@ -2278,7 +2284,9 @@ static int wait_tb_buffers_until_unlocked(struct tree_balance *tb)
REPEAT_SEARCH : CARRY_ON;
}
#endif
+ reiserfs_write_unlock(tb->tb_sb);
__wait_on_buffer(locked);
+ reiserfs_write_lock(tb->tb_sb);
if (FILESYSTEM_CHANGED_TB(tb))
return REPEAT_SEARCH;
}
@@ -2349,7 +2357,9 @@ int fix_nodes(int op_mode, struct tree_balance *tb,

/* if it possible in indirect_to_direct conversion */
if (buffer_locked(tbS0)) {
+ reiserfs_write_unlock(tb->tb_sb);
__wait_on_buffer(tbS0);
+ reiserfs_write_lock(tb->tb_sb);
if (FILESYSTEM_CHANGED_TB(tb))
return REPEAT_SEARCH;
}
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 6fd0f47..153668e 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -489,10 +489,14 @@ static int reiserfs_get_blocks_direct_io(struct inode *inode,
disappeared */
if (REISERFS_I(inode)->i_flags & i_pack_on_close_mask) {
int err;
- lock_kernel();
+
+ reiserfs_write_lock(inode->i_sb);
+
err = reiserfs_commit_for_inode(inode);
REISERFS_I(inode)->i_flags &= ~i_pack_on_close_mask;
- unlock_kernel();
+
+ reiserfs_write_unlock(inode->i_sb);
+
if (err < 0)
ret = err;
}
@@ -616,7 +620,6 @@ int reiserfs_get_block(struct inode *inode, sector_t block,
loff_t new_offset =
(((loff_t) block) << inode->i_sb->s_blocksize_bits) + 1;

- /* bad.... */
reiserfs_write_lock(inode->i_sb);
version = get_inode_item_key_version(inode);

@@ -997,10 +1000,14 @@ int reiserfs_get_block(struct inode *inode, sector_t block,
if (retval)
goto failure;
}
- /* inserting indirect pointers for a hole can take a
- ** long time. reschedule if needed
+ /*
+ * inserting indirect pointers for a hole can take a
+ * long time. reschedule if needed and also release the write
+ * lock for others.
*/
+ reiserfs_write_unlock(inode->i_sb);
cond_resched();
+ reiserfs_write_lock(inode->i_sb);

retval = search_for_position_by_key(inode->i_sb, &key, &path);
if (retval == IO_ERROR) {
@@ -2076,8 +2083,9 @@ int reiserfs_truncate_file(struct inode *inode, int update_timestamps)
int error;
struct buffer_head *bh = NULL;
int err2;
+ int lock_depth;

- reiserfs_write_lock(inode->i_sb);
+ lock_depth = reiserfs_write_lock_once(inode->i_sb);

if (inode->i_size > 0) {
error = grab_tail_page(inode, &page, &bh);
@@ -2146,14 +2154,17 @@ int reiserfs_truncate_file(struct inode *inode, int update_timestamps)
page_cache_release(page);
}

- reiserfs_write_unlock(inode->i_sb);
+ reiserfs_write_unlock_once(inode->i_sb, lock_depth);
+
return 0;
out:
if (page) {
unlock_page(page);
page_cache_release(page);
}
- reiserfs_write_unlock(inode->i_sb);
+
+ reiserfs_write_unlock_once(inode->i_sb, lock_depth);
+
return error;
}

@@ -2612,7 +2623,10 @@ int reiserfs_prepare_write(struct file *f, struct page *page,
int ret;
int old_ref = 0;

+ reiserfs_write_unlock(inode->i_sb);
reiserfs_wait_on_write_block(inode->i_sb);
+ reiserfs_write_lock(inode->i_sb);
+
fix_tail_page_for_writing(page);
if (reiserfs_transaction_running(inode->i_sb)) {
struct reiserfs_transaction_handle *th;
@@ -2762,7 +2776,10 @@ int reiserfs_commit_write(struct file *f, struct page *page,
int update_sd = 0;
struct reiserfs_transaction_handle *th = NULL;

+ reiserfs_write_unlock(inode->i_sb);
reiserfs_wait_on_write_block(inode->i_sb);
+ reiserfs_write_lock(inode->i_sb);
+
if (reiserfs_transaction_running(inode->i_sb)) {
th = current->journal_info;
}
diff --git a/fs/reiserfs/ioctl.c b/fs/reiserfs/ioctl.c
index 0ccc3fd..5e40b0c 100644
--- a/fs/reiserfs/ioctl.c
+++ b/fs/reiserfs/ioctl.c
@@ -141,9 +141,11 @@ long reiserfs_compat_ioctl(struct file *file, unsigned int cmd,
default:
return -ENOIOCTLCMD;
}
- lock_kernel();
+
+ reiserfs_write_lock(inode->i_sb);
ret = reiserfs_ioctl(inode, file, cmd, (unsigned long) compat_ptr(arg));
- unlock_kernel();
+ reiserfs_write_unlock(inode->i_sb);
+
return ret;
}
#endif
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 77f5bb7..7976d7d 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -429,21 +429,6 @@ static void clear_prepared_bits(struct buffer_head *bh)
clear_buffer_journal_restore_dirty(bh);
}

-/* utility function to force a BUG if it is called without the big
-** kernel lock held. caller is the string printed just before calling BUG()
-*/
-void reiserfs_check_lock_depth(struct super_block *sb, char *caller)
-{
-#ifdef CONFIG_SMP
- if (current->lock_depth < 0) {
- reiserfs_panic(sb, "journal-1", "%s called without kernel "
- "lock held", caller);
- }
-#else
- ;
-#endif
-}
-
/* return a cnode with same dev, block number and size in table, or null if not found */
static inline struct reiserfs_journal_cnode *get_journal_hash_dev(struct
super_block
@@ -552,11 +537,48 @@ static inline void insert_journal_hash(struct reiserfs_journal_cnode **table,
journal_hash(table, cn->sb, cn->blocknr) = cn;
}

+/*
+ * Several mutexes depend on the write lock.
+ * However sometimes we want to relax the write lock while we hold
+ * these mutexes, according to the release/reacquire on schedule()
+ * properties of the Bkl that were used.
+ * Reiserfs performances and locking were based on this scheme.
+ * Now that the write lock is a mutex and not the bkl anymore, doing so
+ * may result in a deadlock:
+ *
+ * A acquire write_lock
+ * A acquire j_commit_mutex
+ * A release write_lock and wait for something
+ * B acquire write_lock
+ * B can't acquire j_commit_mutex and sleep
+ * A can't acquire write lock anymore
+ * deadlock
+ *
+ * What we do here is avoiding such deadlock by playing the same game
+ * than the Bkl: if we can't acquire a mutex that depends on the write lock,
+ * we release the write lock, wait a bit and then retry.
+ *
+ * The mutexes concerned by this hack are:
+ * - The commit mutex of a journal list
+ * - The flush mutex
+ * - The journal lock
+ */
+static inline void reiserfs_mutex_lock_safe(struct mutex *m,
+ struct super_block *s)
+{
+ while (!mutex_trylock(m)) {
+ reiserfs_write_unlock(s);
+ schedule();
+ reiserfs_write_lock(s);
+ }
+}
+
/* lock the current transaction */
static inline void lock_journal(struct super_block *sb)
{
PROC_INFO_INC(sb, journal.lock_journal);
- mutex_lock(&SB_JOURNAL(sb)->j_mutex);
+
+ reiserfs_mutex_lock_safe(&SB_JOURNAL(sb)->j_mutex, sb);
}

/* unlock the current transaction */
@@ -708,7 +730,9 @@ static void check_barrier_completion(struct super_block *s,
disable_barrier(s);
set_buffer_uptodate(bh);
set_buffer_dirty(bh);
+ reiserfs_write_unlock(s);
sync_dirty_buffer(bh);
+ reiserfs_write_lock(s);
}
}

@@ -996,8 +1020,13 @@ static int reiserfs_async_progress_wait(struct super_block *s)
{
DEFINE_WAIT(wait);
struct reiserfs_journal *j = SB_JOURNAL(s);
- if (atomic_read(&j->j_async_throttle))
+
+ if (atomic_read(&j->j_async_throttle)) {
+ reiserfs_write_unlock(s);
congestion_wait(WRITE, HZ / 10);
+ reiserfs_write_lock(s);
+ }
+
return 0;
}

@@ -1043,7 +1072,8 @@ static int flush_commit_list(struct super_block *s,
}

/* make sure nobody is trying to flush this one at the same time */
- mutex_lock(&jl->j_commit_mutex);
+ reiserfs_mutex_lock_safe(&jl->j_commit_mutex, s);
+
if (!journal_list_still_alive(s, trans_id)) {
mutex_unlock(&jl->j_commit_mutex);
goto put_jl;
@@ -1061,12 +1091,17 @@ static int flush_commit_list(struct super_block *s,

if (!list_empty(&jl->j_bh_list)) {
int ret;
- unlock_kernel();
+
+ /*
+ * We might sleep in numerous places inside
+ * write_ordered_buffers. Relax the write lock.
+ */
+ reiserfs_write_unlock(s);
ret = write_ordered_buffers(&journal->j_dirty_buffers_lock,
journal, jl, &jl->j_bh_list);
if (ret < 0 && retval == 0)
retval = ret;
- lock_kernel();
+ reiserfs_write_lock(s);
}
BUG_ON(!list_empty(&jl->j_bh_list));
/*
@@ -1114,12 +1149,19 @@ static int flush_commit_list(struct super_block *s,
bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) +
(jl->j_start + i) % SB_ONDISK_JOURNAL_SIZE(s);
tbh = journal_find_get_block(s, bn);
+
+ reiserfs_write_unlock(s);
wait_on_buffer(tbh);
+ reiserfs_write_lock(s);
// since we're using ll_rw_blk above, it might have skipped over
// a locked buffer. Double check here
//
- if (buffer_dirty(tbh)) /* redundant, sync_dirty_buffer() checks */
+ /* redundant, sync_dirty_buffer() checks */
+ if (buffer_dirty(tbh)) {
+ reiserfs_write_unlock(s);
sync_dirty_buffer(tbh);
+ reiserfs_write_lock(s);
+ }
if (unlikely(!buffer_uptodate(tbh))) {
#ifdef CONFIG_REISERFS_CHECK
reiserfs_warning(s, "journal-601",
@@ -1143,10 +1185,15 @@ static int flush_commit_list(struct super_block *s,
if (buffer_dirty(jl->j_commit_bh))
BUG();
mark_buffer_dirty(jl->j_commit_bh) ;
+ reiserfs_write_unlock(s);
sync_dirty_buffer(jl->j_commit_bh) ;
+ reiserfs_write_lock(s);
}
- } else
+ } else {
+ reiserfs_write_unlock(s);
wait_on_buffer(jl->j_commit_bh);
+ reiserfs_write_lock(s);
+ }

check_barrier_completion(s, jl->j_commit_bh);

@@ -1286,7 +1333,9 @@ static int _update_journal_header_block(struct super_block *sb,

if (trans_id >= journal->j_last_flush_trans_id) {
if (buffer_locked((journal->j_header_bh))) {
+ reiserfs_write_unlock(sb);
wait_on_buffer((journal->j_header_bh));
+ reiserfs_write_lock(sb);
if (unlikely(!buffer_uptodate(journal->j_header_bh))) {
#ifdef CONFIG_REISERFS_CHECK
reiserfs_warning(sb, "journal-699",
@@ -1312,12 +1361,16 @@ static int _update_journal_header_block(struct super_block *sb,
disable_barrier(sb);
goto sync;
}
+ reiserfs_write_unlock(sb);
wait_on_buffer(journal->j_header_bh);
+ reiserfs_write_lock(sb);
check_barrier_completion(sb, journal->j_header_bh);
} else {
sync:
set_buffer_dirty(journal->j_header_bh);
+ reiserfs_write_unlock(sb);
sync_dirty_buffer(journal->j_header_bh);
+ reiserfs_write_lock(sb);
}
if (!buffer_uptodate(journal->j_header_bh)) {
reiserfs_warning(sb, "journal-837",
@@ -1409,7 +1462,7 @@ static int flush_journal_list(struct super_block *s,

/* if flushall == 0, the lock is already held */
if (flushall) {
- mutex_lock(&journal->j_flush_mutex);
+ reiserfs_mutex_lock_safe(&journal->j_flush_mutex, s);
} else if (mutex_trylock(&journal->j_flush_mutex)) {
BUG();
}
@@ -1553,7 +1606,11 @@ static int flush_journal_list(struct super_block *s,
reiserfs_panic(s, "journal-1011",
"cn->bh is NULL");
}
+
+ reiserfs_write_unlock(s);
wait_on_buffer(cn->bh);
+ reiserfs_write_lock(s);
+
if (!cn->bh) {
reiserfs_panic(s, "journal-1012",
"cn->bh is NULL");
@@ -1769,7 +1826,7 @@ static int kupdate_transactions(struct super_block *s,
struct reiserfs_journal *journal = SB_JOURNAL(s);
chunk.nr = 0;

- mutex_lock(&journal->j_flush_mutex);
+ reiserfs_mutex_lock_safe(&journal->j_flush_mutex, s);
if (!journal_list_still_alive(s, orig_trans_id)) {
goto done;
}
@@ -1973,11 +2030,19 @@ static int do_journal_release(struct reiserfs_transaction_handle *th,
reiserfs_mounted_fs_count--;
/* wait for all commits to finish */
cancel_delayed_work(&SB_JOURNAL(sb)->j_work);
+
+ /*
+ * We must release the write lock here because
+ * the workqueue job (flush_async_commit) needs this lock
+ */
+ reiserfs_write_unlock(sb);
flush_workqueue(commit_wq);
+
if (!reiserfs_mounted_fs_count) {
destroy_workqueue(commit_wq);
commit_wq = NULL;
}
+ reiserfs_write_lock(sb);

free_journal_ram(sb);

@@ -2243,7 +2308,11 @@ static int journal_read_transaction(struct super_block *sb,
/* read in the log blocks, memcpy to the corresponding real block */
ll_rw_block(READ, get_desc_trans_len(desc), log_blocks);
for (i = 0; i < get_desc_trans_len(desc); i++) {
+
+ reiserfs_write_unlock(sb);
wait_on_buffer(log_blocks[i]);
+ reiserfs_write_lock(sb);
+
if (!buffer_uptodate(log_blocks[i])) {
reiserfs_warning(sb, "journal-1212",
"REPLAY FAILURE fsck required! "
@@ -2964,8 +3033,11 @@ static void queue_log_writer(struct super_block *s)
init_waitqueue_entry(&wait, current);
add_wait_queue(&journal->j_join_wait, &wait);
set_current_state(TASK_UNINTERRUPTIBLE);
- if (test_bit(J_WRITERS_QUEUED, &journal->j_state))
+ if (test_bit(J_WRITERS_QUEUED, &journal->j_state)) {
+ reiserfs_write_unlock(s);
schedule();
+ reiserfs_write_lock(s);
+ }
__set_current_state(TASK_RUNNING);
remove_wait_queue(&journal->j_join_wait, &wait);
}
@@ -2982,7 +3054,9 @@ static void let_transaction_grow(struct super_block *sb, unsigned int trans_id)
struct reiserfs_journal *journal = SB_JOURNAL(sb);
unsigned long bcount = journal->j_bcount;
while (1) {
+ reiserfs_write_unlock(sb);
schedule_timeout_uninterruptible(1);
+ reiserfs_write_lock(sb);
journal->j_current_jl->j_state |= LIST_COMMIT_PENDING;
while ((atomic_read(&journal->j_wcount) > 0 ||
atomic_read(&journal->j_jlock)) &&
@@ -3033,7 +3107,9 @@ static int do_journal_begin_r(struct reiserfs_transaction_handle *th,

if (test_bit(J_WRITERS_BLOCKED, &journal->j_state)) {
unlock_journal(sb);
+ reiserfs_write_unlock(sb);
reiserfs_wait_on_write_block(sb);
+ reiserfs_write_lock(sb);
PROC_INFO_INC(sb, journal.journal_relock_writers);
goto relock;
}
@@ -3506,14 +3582,14 @@ static void flush_async_commits(struct work_struct *work)
struct reiserfs_journal_list *jl;
struct list_head *entry;

- lock_kernel();
+ reiserfs_write_lock(sb);
if (!list_empty(&journal->j_journal_list)) {
/* last entry is the youngest, commit it and you get everything */
entry = journal->j_journal_list.prev;
jl = JOURNAL_LIST_ENTRY(entry);
flush_commit_list(sb, jl, 1);
}
- unlock_kernel();
+ reiserfs_write_unlock(sb);
}

/*
@@ -4041,7 +4117,7 @@ static int do_journal_end(struct reiserfs_transaction_handle *th,
* the new transaction is fully setup, and we've already flushed the
* ordered bh list
*/
- mutex_lock(&jl->j_commit_mutex);
+ reiserfs_mutex_lock_safe(&jl->j_commit_mutex, sb);

/* save the transaction id in case we need to commit it later */
commit_trans_id = jl->j_trans_id;
@@ -4203,10 +4279,10 @@ static int do_journal_end(struct reiserfs_transaction_handle *th,
* is lost.
*/
if (!list_empty(&jl->j_tail_bh_list)) {
- unlock_kernel();
+ reiserfs_write_unlock(sb);
write_ordered_buffers(&journal->j_dirty_buffers_lock,
journal, jl, &jl->j_tail_bh_list);
- lock_kernel();
+ reiserfs_write_lock(sb);
}
BUG_ON(!list_empty(&jl->j_tail_bh_list));
mutex_unlock(&jl->j_commit_mutex);
diff --git a/fs/reiserfs/lock.c b/fs/reiserfs/lock.c
new file mode 100644
index 0000000..cb1bba3
--- /dev/null
+++ b/fs/reiserfs/lock.c
@@ -0,0 +1,89 @@
+#include <linux/reiserfs_fs.h>
+#include <linux/mutex.h>
+
+/*
+ * The previous reiserfs locking scheme was heavily based on
+ * the tricky properties of the Bkl:
+ *
+ * - it was acquired recursively by a same task
+ * - the performances relied on the release-while-schedule() property
+ *
+ * Now that we replace it by a mutex, we still want to keep the same
+ * recursive property to avoid big changes in the code structure.
+ * We use our own lock_owner here because the owner field on a mutex
+ * is only available in SMP or mutex debugging, also we only need this field
+ * for this mutex, no need for a system wide mutex facility.
+ *
+ * Also this lock is often released before a call that could block because
+ * reiserfs performances were partialy based on the release while schedule()
+ * property of the Bkl.
+ */
+void reiserfs_write_lock(struct super_block *s)
+{
+ struct reiserfs_sb_info *sb_i = REISERFS_SB(s);
+
+ if (sb_i->lock_owner != current) {
+ mutex_lock(&sb_i->lock);
+ sb_i->lock_owner = current;
+ }
+
+ /* No need to protect it, only the current task touches it */
+ sb_i->lock_depth++;
+}
+
+void reiserfs_write_unlock(struct super_block *s)
+{
+ struct reiserfs_sb_info *sb_i = REISERFS_SB(s);
+
+ /*
+ * Are we unlocking without even holding the lock?
+ * Such a situation could even raise a BUG() if we don't
+ * want the data become corrupted
+ */
+ WARN_ONCE(sb_i->lock_owner != current,
+ "Superblock write lock imbalance");
+
+ if (--sb_i->lock_depth == -1) {
+ sb_i->lock_owner = NULL;
+ mutex_unlock(&sb_i->lock);
+ }
+}
+
+/*
+ * If we already own the lock, just exit and don't increase the depth.
+ * Useful when we don't want to lock more than once.
+ *
+ * We always return the lock_depth we had before calling
+ * this function.
+ */
+int reiserfs_write_lock_once(struct super_block *s)
+{
+ struct reiserfs_sb_info *sb_i = REISERFS_SB(s);
+
+ if (sb_i->lock_owner != current) {
+ mutex_lock(&sb_i->lock);
+ sb_i->lock_owner = current;
+ return sb_i->lock_depth++;
+ }
+
+ return sb_i->lock_depth;
+}
+
+void reiserfs_write_unlock_once(struct super_block *s, int lock_depth)
+{
+ if (lock_depth == -1)
+ reiserfs_write_unlock(s);
+}
+
+/*
+ * Utility function to force a BUG if it is called without the superblock
+ * write lock held. caller is the string printed just before calling BUG()
+ */
+void reiserfs_check_lock_depth(struct super_block *sb, char *caller)
+{
+ struct reiserfs_sb_info *sb_i = REISERFS_SB(sb);
+
+ if (sb_i->lock_depth < 0)
+ reiserfs_panic(sb, "%s called without kernel lock held %d",
+ caller);
+}
diff --git a/fs/reiserfs/resize.c b/fs/reiserfs/resize.c
index 238e9d9..6a7bfb3 100644
--- a/fs/reiserfs/resize.c
+++ b/fs/reiserfs/resize.c
@@ -142,7 +142,9 @@ int reiserfs_resize(struct super_block *s, unsigned long block_count_new)

set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
+ reiserfs_write_unlock(s);
sync_dirty_buffer(bh);
+ reiserfs_write_lock(s);
// update bitmap_info stuff
bitmap[i].free_count = sb_blocksize(sb) * 8 - 1;
brelse(bh);
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index d036ee5..6bd99a9 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -629,7 +629,9 @@ int search_by_key(struct super_block *sb, const struct cpu_key *key, /* Key to s
search_by_key_reada(sb, reada_bh,
reada_blocks, reada_count);
ll_rw_block(READ, 1, &bh);
+ reiserfs_write_unlock(sb);
wait_on_buffer(bh);
+ reiserfs_write_lock(sb);
if (!buffer_uptodate(bh))
goto io_error;
} else {
diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 0ae6486..f6c5606 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -470,6 +470,13 @@ static void reiserfs_put_super(struct super_block *s)
struct reiserfs_transaction_handle th;
th.t_trans_id = 0;

+ /*
+ * We didn't need to explicitly lock here before, because put_super
+ * is called with the bkl held.
+ * Now that we have our own lock, we must explicitly lock.
+ */
+ reiserfs_write_lock(s);
+
/* change file system state to current state if it was mounted with read-write permissions */
if (!(s->s_flags & MS_RDONLY)) {
if (!journal_begin(&th, s, 10)) {
@@ -499,6 +506,8 @@ static void reiserfs_put_super(struct super_block *s)

reiserfs_proc_info_done(s);

+ reiserfs_write_unlock(s);
+ mutex_destroy(&REISERFS_SB(s)->lock);
kfree(s->s_fs_info);
s->s_fs_info = NULL;

@@ -558,25 +567,28 @@ static void reiserfs_dirty_inode(struct inode *inode)
struct reiserfs_transaction_handle th;

int err = 0;
+ int lock_depth;
+
if (inode->i_sb->s_flags & MS_RDONLY) {
reiserfs_warning(inode->i_sb, "clm-6006",
"writing inode %lu on readonly FS",
inode->i_ino);
return;
}
- reiserfs_write_lock(inode->i_sb);
+ lock_depth = reiserfs_write_lock_once(inode->i_sb);

/* this is really only used for atime updates, so they don't have
** to be included in O_SYNC or fsync
*/
err = journal_begin(&th, inode->i_sb, 1);
- if (err) {
- reiserfs_write_unlock(inode->i_sb);
- return;
- }
+ if (err)
+ goto out;
+
reiserfs_update_sd(&th, inode);
journal_end(&th, inode->i_sb, 1);
- reiserfs_write_unlock(inode->i_sb);
+
+out:
+ reiserfs_write_unlock_once(inode->i_sb, lock_depth);
}

#ifdef CONFIG_REISERFS_FS_POSIX_ACL
@@ -1191,7 +1203,15 @@ static int reiserfs_remount(struct super_block *s, int *mount_flags, char *arg)
unsigned int qfmt = 0;
#ifdef CONFIG_QUOTA
int i;
+#endif
+
+ /*
+ * We used to protect using the implicitly acquired bkl here.
+ * Now we must explictly acquire our own lock
+ */
+ reiserfs_write_lock(s);

+#ifdef CONFIG_QUOTA
memcpy(qf_names, REISERFS_SB(s)->s_qf_names, sizeof(qf_names));
#endif

@@ -1316,11 +1336,13 @@ static int reiserfs_remount(struct super_block *s, int *mount_flags, char *arg)
}

out_ok:
+ reiserfs_write_unlock(s);
kfree(s->s_options);
s->s_options = new_opts;
return 0;

out_err:
+ reiserfs_write_unlock(s);
kfree(new_opts);
return err;
}
@@ -1425,7 +1447,9 @@ static int read_super_block(struct super_block *s, int offset)
static int reread_meta_blocks(struct super_block *s)
{
ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s)));
+ reiserfs_write_unlock(s);
wait_on_buffer(SB_BUFFER_WITH_SB(s));
+ reiserfs_write_lock(s);
if (!buffer_uptodate(SB_BUFFER_WITH_SB(s))) {
reiserfs_warning(s, "reiserfs-2504", "error reading the super");
return 1;
@@ -1634,7 +1658,7 @@ static int reiserfs_fill_super(struct super_block *s, void *data, int silent)
sbi = kzalloc(sizeof(struct reiserfs_sb_info), GFP_KERNEL);
if (!sbi) {
errval = -ENOMEM;
- goto error;
+ goto error_alloc;
}
s->s_fs_info = sbi;
/* Set default values for options: non-aggressive tails, RO on errors */
@@ -1648,6 +1672,20 @@ static int reiserfs_fill_super(struct super_block *s, void *data, int silent)
/* setup default block allocator options */
reiserfs_init_alloc_options(s);

+ mutex_init(&REISERFS_SB(s)->lock);
+ REISERFS_SB(s)->lock_depth = -1;
+
+ /*
+ * This function is called with the bkl, which also was the old
+ * locking used here.
+ * do_journal_begin() will soon check if we hold the lock (ie: was the
+ * bkl). This is likely because do_journal_begin() has several another
+ * callers because at this time, it doesn't seem to be necessary to
+ * protect against anything.
+ * Anyway, let's be conservative and lock for now.
+ */
+ reiserfs_write_lock(s);
+
jdev_name = NULL;
if (reiserfs_parse_options
(s, (char *)data, &(sbi->s_mount_opt), &blocks, &jdev_name,
@@ -1871,9 +1909,13 @@ static int reiserfs_fill_super(struct super_block *s, void *data, int silent)
init_waitqueue_head(&(sbi->s_wait));
spin_lock_init(&sbi->bitmap_lock);

+ reiserfs_write_unlock(s);
+
return (0);

error:
+ reiserfs_write_unlock(s);
+error_alloc:
if (jinit_done) { /* kill the commit thread, free journal ram */
journal_release_error(NULL, s);
}
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 4525747..dc4b327 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -84,14 +84,6 @@
*/
#define in_nmi() (preempt_count() & NMI_MASK)

-#if defined(CONFIG_PREEMPT)
-# define PREEMPT_INATOMIC_BASE kernel_locked()
-# define PREEMPT_CHECK_OFFSET 1
-#else
-# define PREEMPT_INATOMIC_BASE 0
-# define PREEMPT_CHECK_OFFSET 0
-#endif
-
/*
* Are we running in atomic context? WARNING: this macro cannot
* always detect atomic context; in particular, it cannot know about
@@ -99,11 +91,17 @@
* used in the general case to determine whether sleeping is possible.
* Do not use in_atomic() in driver code.
*/
-#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE)
+#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0)
+
+#ifdef CONFIG_PREEMPT
+# define PREEMPT_CHECK_OFFSET 1
+#else
+# define PREEMPT_CHECK_OFFSET 0
+#endif

/*
* Check whether we were atomic before we did preempt_disable():
- * (used by the scheduler, *after* releasing the kernel lock)
+ * (used by the scheduler)
*/
#define in_atomic_preempt_off() \
((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
index 2245c78..6587b4e 100644
--- a/include/linux/reiserfs_fs.h
+++ b/include/linux/reiserfs_fs.h
@@ -52,11 +52,15 @@
#define REISERFS_IOC32_GETVERSION FS_IOC32_GETVERSION
#define REISERFS_IOC32_SETVERSION FS_IOC32_SETVERSION

-/* Locking primitives */
-/* Right now we are still falling back to (un)lock_kernel, but eventually that
- would evolve into real per-fs locks */
-#define reiserfs_write_lock( sb ) lock_kernel()
-#define reiserfs_write_unlock( sb ) unlock_kernel()
+/*
+ * Locking primitives. The write lock is a per superblock
+ * special mutex that has properties close to the Big Kernel Lock
+ * which was used in the previous locking scheme.
+ */
+void reiserfs_write_lock(struct super_block *s);
+void reiserfs_write_unlock(struct super_block *s);
+int reiserfs_write_lock_once(struct super_block *s);
+void reiserfs_write_unlock_once(struct super_block *s, int lock_depth);

struct fid;

diff --git a/include/linux/reiserfs_fs_sb.h b/include/linux/reiserfs_fs_sb.h
index 5621d87..cec8319 100644
--- a/include/linux/reiserfs_fs_sb.h
+++ b/include/linux/reiserfs_fs_sb.h
@@ -7,6 +7,8 @@
#ifdef __KERNEL__
#include <linux/workqueue.h>
#include <linux/rwsem.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
#endif

typedef enum {
@@ -355,6 +357,13 @@ struct reiserfs_sb_info {
struct reiserfs_journal *s_journal; /* pointer to journal information */
unsigned short s_mount_state; /* reiserfs state (valid, invalid) */

+ /* Serialize writers access, replace the old bkl */
+ struct mutex lock;
+ /* Owner of the lock (can be recursive) */
+ struct task_struct *lock_owner;
+ /* Depth of the lock, start from -1 like the bkl */
+ int lock_depth;
+
/* Comment? -Hans */
void (*end_io_handler) (struct buffer_head *, int);
hashf_t s_hash_function; /* pointer to function which is used
diff --git a/include/linux/smp_lock.h b/include/linux/smp_lock.h
index 813be59..c80ad37 100644
--- a/include/linux/smp_lock.h
+++ b/include/linux/smp_lock.h
@@ -1,29 +1,9 @@
#ifndef __LINUX_SMPLOCK_H
#define __LINUX_SMPLOCK_H

-#ifdef CONFIG_LOCK_KERNEL
+#include <linux/compiler.h>
#include <linux/sched.h>

-#define kernel_locked() (current->lock_depth >= 0)
-
-extern int __lockfunc __reacquire_kernel_lock(void);
-extern void __lockfunc __release_kernel_lock(void);
-
-/*
- * Release/re-acquire global kernel lock for the scheduler
- */
-#define release_kernel_lock(tsk) do { \
- if (unlikely((tsk)->lock_depth >= 0)) \
- __release_kernel_lock(); \
-} while (0)
-
-static inline int reacquire_kernel_lock(struct task_struct *task)
-{
- if (unlikely(task->lock_depth >= 0))
- return __reacquire_kernel_lock();
- return 0;
-}
-
extern void __lockfunc lock_kernel(void) __acquires(kernel_lock);
extern void __lockfunc unlock_kernel(void) __releases(kernel_lock);

@@ -39,14 +19,12 @@ static inline void cycle_kernel_lock(void)
unlock_kernel();
}

-#else
+static inline int kernel_locked(void)
+{
+ return current->lock_depth >= 0;
+}

-#define lock_kernel() do { } while(0)
-#define unlock_kernel() do { } while(0)
-#define release_kernel_lock(task) do { } while(0)
#define cycle_kernel_lock() do { } while(0)
-#define reacquire_kernel_lock(task) 0
-#define kernel_locked() 1
+extern void debug_print_bkl(void);

-#endif /* CONFIG_LOCK_KERNEL */
-#endif /* __LINUX_SMPLOCK_H */
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..51d9ae7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -57,11 +57,6 @@ config BROKEN_ON_SMP
depends on BROKEN || !SMP
default y

-config LOCK_KERNEL
- bool
- depends on SMP || PREEMPT
- default y
-
config INIT_ENV_ARG_LIMIT
int
default 32 if !UML
diff --git a/init/main.c b/init/main.c
index 3585f07..ab13ebb 100644
--- a/init/main.c
+++ b/init/main.c
@@ -457,7 +457,6 @@ static noinline void __init_refok rest_init(void)
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
- unlock_kernel();

/*
* The boot idle thread must execute schedule()
@@ -557,7 +556,6 @@ asmlinkage void __init start_kernel(void)
* Interrupts are still disabled. Do necessary setups, then
* enable them
*/
- lock_kernel();
tick_init();
boot_cpu_init();
page_address_init();
@@ -631,6 +629,8 @@ asmlinkage void __init start_kernel(void)
*/
locking_selftest();

+ lock_kernel();
+
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
@@ -677,6 +677,7 @@ asmlinkage void __init start_kernel(void)
signals_init();
/* rootfs populating might need page-writeback */
page_writeback_init();
+ unlock_kernel();
#ifdef CONFIG_PROC_FS
proc_root_init();
#endif
@@ -801,7 +802,6 @@ static noinline int init_post(void)
/* need to finish all async __init code before freeing the memory */
async_synchronize_full();
free_initmem();
- unlock_kernel();
mark_rodata_ro();
system_state = SYSTEM_RUNNING;
numa_default_policy();
@@ -841,7 +841,6 @@ static noinline int init_post(void)

static int __init kernel_init(void * unused)
{
- lock_kernel();
/*
* init can run on any cpu.
*/
diff --git a/kernel/fork.c b/kernel/fork.c
index b9e2edd..b5c5089 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -63,6 +63,7 @@
#include <linux/fs_struct.h>
#include <trace/sched.h>
#include <linux/magic.h>
+#include <linux/smp_lock.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -955,6 +956,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
struct task_struct *p;
int cgroup_callbacks_done = 0;

+ if (system_state == SYSTEM_RUNNING && kernel_locked())
+ debug_check_no_locks_held(current);
+
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 022a492..c790a59 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -13,6 +13,7 @@
#include <linux/freezer.h>
#include <linux/kthread.h>
#include <linux/lockdep.h>
+#include <linux/smp_lock.h>
#include <linux/module.h>
#include <linux/sysctl.h>

@@ -100,6 +101,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
sched_show_task(t);
__debug_show_held_locks(t);

+ debug_print_bkl();
+
touch_nmi_watchdog();

if (sysctl_hung_task_panic)
diff --git a/kernel/kmod.c b/kernel/kmod.c
index b750675..de0fe01 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -36,6 +36,8 @@
#include <linux/resource.h>
#include <linux/notifier.h>
#include <linux/suspend.h>
+#include <linux/smp_lock.h>
+
#include <asm/uaccess.h>

extern int max_threads;
@@ -78,6 +80,7 @@ int __request_module(bool wait, const char *fmt, ...)
static atomic_t kmod_concurrent = ATOMIC_INIT(0);
#define MAX_KMOD_CONCURRENT 50 /* Completely arbitrary value - KAO */
static int kmod_loop_msg;
+ int bkl = kernel_locked();

va_start(args, fmt);
ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args);
@@ -109,9 +112,28 @@ int __request_module(bool wait, const char *fmt, ...)
return -ENOMEM;
}

+ /*
+ * usermodehelper blocks waiting for modprobe. We cannot
+ * do that with the BKL held. Also emit a (one time)
+ * warning about callsites that do this:
+ */
+ if (bkl) {
+ if (debug_locks) {
+ WARN_ON_ONCE(1);
+ debug_show_held_locks(current);
+ debug_locks_off();
+ }
+ unlock_kernel();
+ }
+
ret = call_usermodehelper(modprobe_path, argv, envp,
wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
+
atomic_dec(&kmod_concurrent);
+
+ if (bkl)
+ lock_kernel();
+
return ret;
}
EXPORT_SYMBOL(__request_module);
diff --git a/kernel/sched.c b/kernel/sched.c
index 5724508..84155c6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5020,9 +5020,6 @@ asmlinkage void __sched __schedule(void)
prev = rq->curr;
switch_count = &prev->nivcsw;

- release_kernel_lock(prev);
-need_resched_nonpreemptible:
-
schedule_debug(prev);

if (sched_feat(HRTICK))
@@ -5068,10 +5065,7 @@ need_resched_nonpreemptible:
} else
spin_unlock_irq(&rq->lock);

- if (unlikely(reacquire_kernel_lock(current) < 0))
- goto need_resched_nonpreemptible;
}
-
asmlinkage void __sched schedule(void)
{
need_resched:
@@ -6253,11 +6247,6 @@ static void __cond_resched(void)
#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
__might_sleep(__FILE__, __LINE__);
#endif
- /*
- * The BKS might be reacquired before we have dropped
- * PREEMPT_ACTIVE, which could trigger a second
- * cond_resched() call.
- */
do {
add_preempt_count(PREEMPT_ACTIVE);
schedule();
@@ -6565,11 +6554,8 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
spin_unlock_irqrestore(&rq->lock, flags);

/* Set the preempt count _outside_ the spinlocks! */
-#if defined(CONFIG_PREEMPT)
- task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0);
-#else
task_thread_info(idle)->preempt_count = 0;
-#endif
+
/*
* The idle tasks have their own, simple scheduling class:
*/
diff --git a/kernel/softlockup.c b/kernel/softlockup.c
index 88796c3..6c18577 100644
--- a/kernel/softlockup.c
+++ b/kernel/softlockup.c
@@ -17,6 +17,7 @@
#include <linux/notifier.h>
#include <linux/module.h>
#include <linux/sysctl.h>
+#include <linux/smp_lock.h>

#include <asm/irq_regs.h>

diff --git a/kernel/sys.c b/kernel/sys.c
index e7998cf..b740a21 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -8,7 +8,7 @@
#include <linux/mm.h>
#include <linux/utsname.h>
#include <linux/mman.h>
-#include <linux/smp_lock.h>
+#include <linux/mutex.h>
#include <linux/notifier.h>
#include <linux/reboot.h>
#include <linux/prctl.h>
@@ -356,6 +356,8 @@ EXPORT_SYMBOL_GPL(kernel_power_off);
*
* reboot doesn't sync: do that yourself before calling this.
*/
+DEFINE_MUTEX(reboot_lock);
+
SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
void __user *, arg)
{
@@ -380,7 +382,7 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
if ((cmd == LINUX_REBOOT_CMD_POWER_OFF) && !pm_power_off)
cmd = LINUX_REBOOT_CMD_HALT;

- lock_kernel();
+ mutex_lock(&reboot_lock);
switch (cmd) {
case LINUX_REBOOT_CMD_RESTART:
kernel_restart(NULL);
@@ -396,19 +398,19 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,

case LINUX_REBOOT_CMD_HALT:
kernel_halt();
- unlock_kernel();
+ mutex_unlock(&reboot_lock);
do_exit(0);
panic("cannot halt");

case LINUX_REBOOT_CMD_POWER_OFF:
kernel_power_off();
- unlock_kernel();
+ mutex_unlock(&reboot_lock);
do_exit(0);
break;

case LINUX_REBOOT_CMD_RESTART2:
if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0) {
- unlock_kernel();
+ mutex_unlock(&reboot_lock);
return -EFAULT;
}
buffer[sizeof(buffer) - 1] = '\0';
@@ -432,7 +434,8 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
ret = -EINVAL;
break;
}
- unlock_kernel();
+ mutex_unlock(&reboot_lock);
+
return ret;
}

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 1ce5dc6..18d9e86 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -489,13 +489,6 @@ __acquires(kernel_lock)
return -1;
}

- /*
- * When this gets called we hold the BKL which means that
- * preemption is disabled. Various trace selftests however
- * need to disable and enable preemption for successful tests.
- * So we drop the BKL here and grab it after the tests again.
- */
- unlock_kernel();
mutex_lock(&trace_types_lock);

tracing_selftest_running = true;
@@ -583,7 +576,6 @@ __acquires(kernel_lock)
#endif

out_unlock:
- lock_kernel();
return ret;
}

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f71fb2a..d0868e8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -399,13 +399,26 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
void flush_workqueue(struct workqueue_struct *wq)
{
const struct cpumask *cpu_map = wq_cpu_map(wq);
+ int bkl = kernel_locked();
int cpu;

might_sleep();
+ if (bkl) {
+ if (debug_locks) {
+ WARN_ON_ONCE(1);
+ debug_show_held_locks(current);
+ debug_locks_off();
+ }
+ unlock_kernel();
+ }
+
lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);
for_each_cpu(cpu, cpu_map)
flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
+
+ if (bkl)
+ lock_kernel();
}
EXPORT_SYMBOL_GPL(flush_workqueue);

diff --git a/lib/Makefile b/lib/Makefile
index d6edd67..9894a52 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -21,7 +21,7 @@ lib-y += kobject.o kref.o klist.o

obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
- string_helpers.o
+ kernel_lock.o string_helpers.o

ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
@@ -40,7 +40,6 @@ lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
-obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c
index 39f1029..ca03ae8 100644
--- a/lib/kernel_lock.c
+++ b/lib/kernel_lock.c
@@ -1,131 +1,67 @@
/*
- * lib/kernel_lock.c
+ * This is the Big Kernel Lock - the traditional lock that we
+ * inherited from the uniprocessor Linux kernel a decade ago.
*
- * This is the traditional BKL - big kernel lock. Largely
- * relegated to obsolescence, but used by various less
+ * Largely relegated to obsolescence, but used by various less
* important (or lazy) subsystems.
- */
-#include <linux/smp_lock.h>
-#include <linux/module.h>
-#include <linux/kallsyms.h>
-#include <linux/semaphore.h>
-
-/*
- * The 'big kernel lock'
- *
- * This spinlock is taken and released recursively by lock_kernel()
- * and unlock_kernel(). It is transparently dropped and reacquired
- * over schedule(). It is used to protect legacy code that hasn't
- * been migrated to a proper locking design yet.
*
* Don't use in new code.
- */
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag);
-
-
-/*
- * Acquire/release the underlying lock from the scheduler.
*
- * This is called with preemption disabled, and should
- * return an error value if it cannot get the lock and
- * TIF_NEED_RESCHED gets set.
+ * It now has plain mutex semantics (i.e. no auto-drop on
+ * schedule() anymore), combined with a very simple self-recursion
+ * layer that allows the traditional nested use:
*
- * If it successfully gets the lock, it should increment
- * the preemption count like any spinlock does.
+ * lock_kernel();
+ * lock_kernel();
+ * unlock_kernel();
+ * unlock_kernel();
*
- * (This works on UP too - _raw_spin_trylock will never
- * return false in that case)
+ * Please migrate all BKL using code to a plain mutex.
*/
-int __lockfunc __reacquire_kernel_lock(void)
-{
- while (!_raw_spin_trylock(&kernel_flag)) {
- if (need_resched())
- return -EAGAIN;
- cpu_relax();
- }
- preempt_disable();
- return 0;
-}
+#include <linux/smp_lock.h>
+#include <linux/kallsyms.h>
+#include <linux/module.h>
+#include <linux/mutex.h>

-void __lockfunc __release_kernel_lock(void)
-{
- _raw_spin_unlock(&kernel_flag);
- preempt_enable_no_resched();
-}
+static DEFINE_MUTEX(kernel_mutex);

/*
- * These are the BKL spinlocks - we try to be polite about preemption.
- * If SMP is not on (ie UP preemption), this all goes away because the
- * _raw_spin_trylock() will always succeed.
+ * Get the big kernel lock:
*/
-#ifdef CONFIG_PREEMPT
-static inline void __lock_kernel(void)
+void __lockfunc lock_kernel(void)
{
- preempt_disable();
- if (unlikely(!_raw_spin_trylock(&kernel_flag))) {
- /*
- * If preemption was disabled even before this
- * was called, there's nothing we can be polite
- * about - just spin.
- */
- if (preempt_count() > 1) {
- _raw_spin_lock(&kernel_flag);
- return;
- }
+ struct task_struct *task = current;
+ int depth = task->lock_depth + 1;

+ if (likely(!depth))
/*
- * Otherwise, let's wait for the kernel lock
- * with preemption enabled..
+ * No recursion worries - we set up lock_depth _after_
*/
- do {
- preempt_enable();
- while (spin_is_locked(&kernel_flag))
- cpu_relax();
- preempt_disable();
- } while (!_raw_spin_trylock(&kernel_flag));
- }
-}
-
-#else
+ mutex_lock(&kernel_mutex);

-/*
- * Non-preemption case - just get the spinlock
- */
-static inline void __lock_kernel(void)
-{
- _raw_spin_lock(&kernel_flag);
+ task->lock_depth = depth;
}
-#endif

-static inline void __unlock_kernel(void)
+void __lockfunc unlock_kernel(void)
{
- /*
- * the BKL is not covered by lockdep, so we open-code the
- * unlocking sequence (and thus avoid the dep-chain ops):
- */
- _raw_spin_unlock(&kernel_flag);
- preempt_enable();
-}
+ struct task_struct *task = current;

-/*
- * Getting the big kernel lock.
- *
- * This cannot happen asynchronously, so we only need to
- * worry about other CPU's.
- */
-void __lockfunc lock_kernel(void)
-{
- int depth = current->lock_depth+1;
- if (likely(!depth))
- __lock_kernel();
- current->lock_depth = depth;
+ if (WARN_ON_ONCE(task->lock_depth < 0))
+ return;
+
+ if (likely(--task->lock_depth < 0))
+ mutex_unlock(&kernel_mutex);
}

-void __lockfunc unlock_kernel(void)
+void debug_print_bkl(void)
{
- BUG_ON(current->lock_depth < 0);
- if (likely(--current->lock_depth < 0))
- __unlock_kernel();
+#ifdef CONFIG_DEBUG_MUTEXES
+ if (mutex_is_locked(&kernel_mutex)) {
+ printk(KERN_EMERG "BUG: **** BKL held by: %d:%s\n",
+ kernel_mutex.owner->task->pid,
+ kernel_mutex.owner->task->comm);
+ }
+#endif
}

EXPORT_SYMBOL(lock_kernel);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index ff50a05..e28d0fd 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -224,9 +224,15 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);

static int rpc_wait_bit_killable(void *word)
{
+ int bkl = kernel_locked();
+
if (fatal_signal_pending(current))
return -ERESTARTSYS;
+ if (bkl)
+ unlock_kernel();
schedule();
+ if (bkl)
+ lock_kernel();
return 0;
}

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index c200d92..acfb60c 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -600,6 +600,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
struct xdr_buf *arg;
DECLARE_WAITQUEUE(wait, current);
long time_left;
+ int bkl = kernel_locked();

dprintk("svc: server %p waiting for data (to = %ld)\n",
rqstp, timeout);
@@ -624,7 +625,11 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
set_current_state(TASK_RUNNING);
return -EINTR;
}
+ if (bkl)
+ unlock_kernel();
schedule_timeout(msecs_to_jiffies(500));
+ if (bkl)
+ lock_kernel();
}
rqstp->rq_pages[i] = p;
}
@@ -643,7 +648,11 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
arg->tail[0].iov_len = 0;

try_to_freeze();
+ if (bkl)
+ unlock_kernel();
cond_resched();
+ if (bkl)
+ lock_kernel();
if (signalled() || kthread_should_stop())
return -EINTR;

@@ -685,7 +694,11 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
add_wait_queue(&rqstp->rq_wait, &wait);
spin_unlock_bh(&pool->sp_lock);

+ if (bkl)
+ unlock_kernel();
time_left = schedule_timeout(timeout);
+ if (bkl)
+ lock_kernel();

try_to_freeze();

diff --git a/sound/core/info.c b/sound/core/info.c
index 35df614..eb81d55 100644
--- a/sound/core/info.c
+++ b/sound/core/info.c
@@ -22,7 +22,6 @@
#include <linux/init.h>
#include <linux/time.h>
#include <linux/mm.h>
-#include <linux/smp_lock.h>
#include <linux/string.h>
#include <sound/core.h>
#include <sound/minors.h>
@@ -163,13 +162,14 @@ static void snd_remove_proc_entry(struct proc_dir_entry *parent,

static loff_t snd_info_entry_llseek(struct file *file, loff_t offset, int orig)
{
+ struct inode *inode = file->f_path.dentry->d_inode;
struct snd_info_private_data *data;
struct snd_info_entry *entry;
loff_t ret;

data = file->private_data;
entry = data->entry;
- lock_kernel();
+ mutex_lock(&inode->i_mutex);
switch (entry->content) {
case SNDRV_INFO_CONTENT_TEXT:
switch (orig) {
@@ -198,7 +198,7 @@ static loff_t snd_info_entry_llseek(struct file *file, loff_t offset, int orig)
}
ret = -ENXIO;
out:
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return ret;
}

diff --git a/sound/core/sound.c b/sound/core/sound.c
index 7872a02..b4ba31d 100644
--- a/sound/core/sound.c
+++ b/sound/core/sound.c
@@ -21,7 +21,6 @@

#include <linux/init.h>
#include <linux/slab.h>
-#include <linux/smp_lock.h>
#include <linux/time.h>
#include <linux/device.h>
#include <linux/moduleparam.h>
@@ -172,9 +171,9 @@ static int snd_open(struct inode *inode, struct file *file)
{
int ret;

- lock_kernel();
+ mutex_lock(&inode->i_mutex);
ret = __snd_open(inode, file);
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return ret;
}

diff --git a/sound/oss/au1550_ac97.c b/sound/oss/au1550_ac97.c
index 4191acc..98318b0 100644
--- a/sound/oss/au1550_ac97.c
+++ b/sound/oss/au1550_ac97.c
@@ -49,7 +49,6 @@
#include <linux/poll.h>
#include <linux/bitops.h>
#include <linux/spinlock.h>
-#include <linux/smp_lock.h>
#include <linux/ac97_codec.h>
#include <linux/mutex.h>

@@ -1254,7 +1253,6 @@ au1550_mmap(struct file *file, struct vm_area_struct *vma)
unsigned long size;
int ret = 0;

- lock_kernel();
mutex_lock(&s->sem);
if (vma->vm_flags & VM_WRITE)
db = &s->dma_dac;
@@ -1282,7 +1280,6 @@ au1550_mmap(struct file *file, struct vm_area_struct *vma)
db->mapped = 1;
out:
mutex_unlock(&s->sem);
- unlock_kernel();
return ret;
}

@@ -1854,12 +1851,9 @@ au1550_release(struct inode *inode, struct file *file)
{
struct au1550_state *s = (struct au1550_state *)file->private_data;

- lock_kernel();

if (file->f_mode & FMODE_WRITE) {
- unlock_kernel();
drain_dac(s, file->f_flags & O_NONBLOCK);
- lock_kernel();
}

mutex_lock(&s->open_mutex);
@@ -1876,7 +1870,6 @@ au1550_release(struct inode *inode, struct file *file)
s->open_mode &= ((~file->f_mode) & (FMODE_READ|FMODE_WRITE));
mutex_unlock(&s->open_mutex);
wake_up(&s->open_wait);
- unlock_kernel();
return 0;
}

diff --git a/sound/oss/dmasound/dmasound_core.c b/sound/oss/dmasound/dmasound_core.c
index 793b7f4..86d7b9f 100644
--- a/sound/oss/dmasound/dmasound_core.c
+++ b/sound/oss/dmasound/dmasound_core.c
@@ -181,7 +181,7 @@
#include <linux/init.h>
#include <linux/soundcard.h>
#include <linux/poll.h>
-#include <linux/smp_lock.h>
+#include <linux/mutex.h>

#include <asm/uaccess.h>

@@ -329,10 +329,10 @@ static int mixer_open(struct inode *inode, struct file *file)

static int mixer_release(struct inode *inode, struct file *file)
{
- lock_kernel();
+ mutex_lock(&inode->i_mutex);
mixer.busy = 0;
module_put(dmasound.mach.owner);
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return 0;
}
static int mixer_ioctl(struct inode *inode, struct file *file, u_int cmd,
@@ -848,7 +848,7 @@ static int sq_release(struct inode *inode, struct file *file)
{
int rc = 0;

- lock_kernel();
+ mutex_lock(&inode->i_mutex);

if (file->f_mode & FMODE_WRITE) {
if (write_sq.busy)
@@ -879,7 +879,7 @@ static int sq_release(struct inode *inode, struct file *file)
write_sq_wake_up(file); /* checks f_mode */
#endif /* blocking open() */

- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);

return rc;
}
@@ -1296,10 +1296,10 @@ printk("dmasound: stat buffer used %d bytes\n", len) ;

static int state_release(struct inode *inode, struct file *file)
{
- lock_kernel();
+ mutex_lock($inode->i_mutex);
state.busy = 0;
module_put(dmasound.mach.owner);
- unlock_kernel();
+ mutex_unlock($inode->i_mutex);
return 0;
}

diff --git a/sound/oss/msnd_pinnacle.c b/sound/oss/msnd_pinnacle.c
index bf27e00..039f57d 100644
--- a/sound/oss/msnd_pinnacle.c
+++ b/sound/oss/msnd_pinnacle.c
@@ -40,7 +40,7 @@
#include <linux/delay.h>
#include <linux/init.h>
#include <linux/interrupt.h>
-#include <linux/smp_lock.h>
+#include <linux/mutex.h>
#include <asm/irq.h>
#include <asm/io.h>
#include "sound_config.h"
@@ -791,14 +791,14 @@ static int dev_release(struct inode *inode, struct file *file)
int minor = iminor(inode);
int err = 0;

- lock_kernel();
+ mutex_lock(&inode->i_mutex);
if (minor == dev.dsp_minor)
err = dsp_release(file);
else if (minor == dev.mixer_minor) {
/* nothing */
} else
err = -EINVAL;
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return err;
}

diff --git a/sound/oss/soundcard.c b/sound/oss/soundcard.c
index 61aaeda..5376d7e 100644
--- a/sound/oss/soundcard.c
+++ b/sound/oss/soundcard.c
@@ -41,7 +41,7 @@
#include <linux/major.h>
#include <linux/delay.h>
#include <linux/proc_fs.h>
-#include <linux/smp_lock.h>
+#include <linux/mutex.h>
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/device.h>
@@ -143,6 +143,7 @@ static int get_mixer_levels(void __user * arg)

static ssize_t sound_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
{
+ struct inode *inode = file->f_path.dentry->d_inode;
int dev = iminor(file->f_path.dentry->d_inode);
int ret = -EINVAL;

@@ -152,7 +153,7 @@ static ssize_t sound_read(struct file *file, char __user *buf, size_t count, lof
* big one anyway, we might as well bandage here..
*/

- lock_kernel();
+ mutex_lock(&inode->i_mutex);

DEB(printk("sound_read(dev=%d, count=%d)\n", dev, count));
switch (dev & 0x0f) {
@@ -170,16 +171,17 @@ static ssize_t sound_read(struct file *file, char __user *buf, size_t count, lof
case SND_DEV_MIDIN:
ret = MIDIbuf_read(dev, file, buf, count);
}
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return ret;
}

static ssize_t sound_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
{
+ struct inode *inode = file->f_path.dentry->d_inode;
int dev = iminor(file->f_path.dentry->d_inode);
int ret = -EINVAL;

- lock_kernel();
+ mutex_lock(&inode->i_mutex);
DEB(printk("sound_write(dev=%d, count=%d)\n", dev, count));
switch (dev & 0x0f) {
case SND_DEV_SEQ:
@@ -197,7 +199,7 @@ static ssize_t sound_write(struct file *file, const char __user *buf, size_t cou
ret = MIDIbuf_write(dev, file, buf, count);
break;
}
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return ret;
}

@@ -254,7 +256,7 @@ static int sound_release(struct inode *inode, struct file *file)
{
int dev = iminor(inode);

- lock_kernel();
+ mutex_lock(&inode->i_mutex);
DEB(printk("sound_release(dev=%d)\n", dev));
switch (dev & 0x0f) {
case SND_DEV_CTL:
@@ -279,7 +281,7 @@ static int sound_release(struct inode *inode, struct file *file)
default:
printk(KERN_ERR "Sound error: Releasing unknown device 0x%02x\n", dev);
}
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);

return 0;
}
@@ -417,6 +419,7 @@ static unsigned int sound_poll(struct file *file, poll_table * wait)

static int sound_mmap(struct file *file, struct vm_area_struct *vma)
{
+ struct inode *inode = file->f_path.dentry->d_inode;
int dev_class;
unsigned long size;
struct dma_buffparms *dmap = NULL;
@@ -429,35 +432,35 @@ static int sound_mmap(struct file *file, struct vm_area_struct *vma)
printk(KERN_ERR "Sound: mmap() not supported for other than audio devices\n");
return -EINVAL;
}
- lock_kernel();
+ mutex_lock(&inode->i_mutex);
if (vma->vm_flags & VM_WRITE) /* Map write and read/write to the output buf */
dmap = audio_devs[dev]->dmap_out;
else if (vma->vm_flags & VM_READ)
dmap = audio_devs[dev]->dmap_in;
else {
printk(KERN_ERR "Sound: Undefined mmap() access\n");
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return -EINVAL;
}

if (dmap == NULL) {
printk(KERN_ERR "Sound: mmap() error. dmap == NULL\n");
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return -EIO;
}
if (dmap->raw_buf == NULL) {
printk(KERN_ERR "Sound: mmap() called when raw_buf == NULL\n");
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return -EIO;
}
if (dmap->mapping_flags) {
printk(KERN_ERR "Sound: mmap() called twice for the same DMA buffer\n");
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return -EIO;
}
if (vma->vm_pgoff != 0) {
printk(KERN_ERR "Sound: mmap() offset must be 0.\n");
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return -EINVAL;
}
size = vma->vm_end - vma->vm_start;
@@ -468,7 +471,7 @@ static int sound_mmap(struct file *file, struct vm_area_struct *vma)
if (remap_pfn_range(vma, vma->vm_start,
virt_to_phys(dmap->raw_buf) >> PAGE_SHIFT,
vma->vm_end - vma->vm_start, vma->vm_page_prot)) {
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return -EAGAIN;
}

@@ -480,7 +483,7 @@ static int sound_mmap(struct file *file, struct vm_area_struct *vma)
memset(dmap->raw_buf,
dmap->neutral_byte,
dmap->bytes_in_use);
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return 0;
}

diff --git a/sound/oss/vwsnd.c b/sound/oss/vwsnd.c
index 187f727..f14e81d 100644
--- a/sound/oss/vwsnd.c
+++ b/sound/oss/vwsnd.c
@@ -145,7 +145,6 @@
#include <linux/init.h>

#include <linux/spinlock.h>
-#include <linux/smp_lock.h>
#include <linux/wait.h>
#include <linux/interrupt.h>
#include <linux/mutex.h>
@@ -3005,7 +3004,6 @@ static int vwsnd_audio_release(struct inode *inode, struct file *file)
vwsnd_port_t *wport = NULL, *rport = NULL;
int err = 0;

- lock_kernel();
mutex_lock(&devc->io_mutex);
{
DBGEV("(inode=0x%p, file=0x%p)\n", inode, file);
@@ -3033,7 +3031,6 @@ static int vwsnd_audio_release(struct inode *inode, struct file *file)
wake_up(&devc->open_wait);
DEC_USE_COUNT;
DBGR();
- unlock_kernel();
return err;
}

diff --git a/sound/sound_core.c b/sound/sound_core.c
index 2b302bb..76691a0 100644
--- a/sound/sound_core.c
+++ b/sound/sound_core.c
@@ -515,7 +515,7 @@ static int soundcore_open(struct inode *inode, struct file *file)
struct sound_unit *s;
const struct file_operations *new_fops = NULL;

- lock_kernel ();
+ mutex_lock(&inode->i_mutex);

chain=unit&0x0F;
if(chain==4 || chain==5) /* dsp/audio/dsp16 */
@@ -564,11 +564,11 @@ static int soundcore_open(struct inode *inode, struct file *file)
file->f_op = fops_get(old_fops);
}
fops_put(old_fops);
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return err;
}
spin_unlock(&sound_loader_lock);
- unlock_kernel();
+ mutex_unlock(&inode->i_mutex);
return -ENODEV;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/