[PATCH 00/08] dm: serious race conditions need addressing

From: Jeff Mahoney
Date: Thu Apr 13 2006 - 16:36:38 EST



Hi all -

While recently hunting down a series of different Oopses observed
in the dm code on multiprocessor systems, I discovered the cause:
There are serious race conditions in the dm subsystem.

The biggest is a result of a patch submitted in Aug 2004 that converted
the minor tracking code from using a dynamically allocated bitmap to
an IDR. With a bitmap, it was enough to mark a minor as allocated, but
the IDR patch used the mapped_device itself to mark a minor off.

I'm surprised we haven't seen many more dm bug reports that we have,
since the approach has a key problem: It places an incompletely
initialized structure on a subsystem-wide IDR, and then drops the lock
protecting the IDR before it completes the initialization. The result is
that the rest of the subsystem is free to access the mapped_device, and
the results can vary depending on where in the initialization process the
mapped_device is.

cpu1 (dmsetup create) cpu2 (dmsetup anything else)
alloc_dev
->next_free_minor
(now is in IDR) find_device
-> dm_get_md (looks in IDR)
* does something with
uninitialized data *
.. finishes initializing,
overwriting anything the
other path has done

I've observed a number of different faults, including generic bad
pointer derefs, and BUGs from mempool due to reference count
corruption resulting in an early deallocation.

There are other races where dm_get() occurs on pointers obtained from
the IDR after the _minor_lock has been dropped. The pointer in this case
could be stale, since the last reference may have been dropped when the
lock was released.

The block layer gets a referenceless pointer to the device. In order to
ensure dm_blk_open isn't occuring while the last reference is being
dropped, it needs to ensure that the pointer is still valid.

The final, more minor, race is that the device continues intializing after
being registered with the block subsystem. A race exists where the event
generated in register_disk could be handled before the device completes
initializing.

In the spirit of submitting patches with small obvious changes, the following
8 patches fix up reference counting in the dm subsystem.
01 - adds an idr_replace IDR library function, so a pointer can be replaced
without needing to do a remove/add pair when a simple traverse-once-
and-assign function will do the same work.
02 - uses a placeholder value to indicate a minor is allocated but
unavailable for use, and then replaces it with the pointer after the
device is completely initialized.
03 - idr_pre_get() is supposed to be called outside of locks and handles
the locking of internal structures itself. This patch moves
idr_pre_get outside of the _minor_lock, in preparation for patch 04.
04 - replaces the _minor_lock mutex with a spinlock so that
atomic_dec_and_lock can be used to properly serialize reference counting
and deallocation.
05 - Due to a chicken/egg problem between dm and the block layer,
gendisk->private_data gets a referenceless pointer to the mapped_device.
This patch adds a DMF_FREEING flag so that the pointer can be validated
before use in dm_blk_open().
06 - In paths where we're searching the IDR, or otherwise holding the
_minor_lock, take the reference while still holding the lock, so
we don't race against dm_put.
07 - Currently, it is possible to unload dm-mod while devices are still in
use. This patch bumps the reference count on device creation and
drops it on removal. Taking a self reference is safe, alloc_dev is
only called from the ioctl path, which must already hold a dm-mod
reference from opening the control file.
08 - Move the last bit of mapped_device initialization above the registration
with the block layer, so that we can ensure the device is completely
initialized before another path gets to it. add_disk()->register_disk()
will cause an event to be generated, so it is possible for a race to
occur given the right conditions.

-Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/