Re: [PATCH 05/10] block: remove per-queue plugging

From: NeilBrown
Date: Mon Apr 11 2011 - 00:50:43 EST

On Tue, 5 Apr 2011 13:05:41 +1000 NeilBrown <neilb@xxxxxxx> wrote:

> On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> > Also, in your MD changes, you removed all calls to md_unplug() but
> > didn't remove md_unplug(). Seems it should be removed along with the
> > 'plug' member of 'struct mddev_t'? Neil?
> I've been distracted by other things and only just managed to have a look at
> this.
> The new plugging code seems to completely ignore the needs of stacked devices
> - or at least my needs in md.
> For RAID1 with a write-intent-bitmap, I queue all write requests and then on
> an unplug I update the write-intent-bitmap to mark all the relevant blocks
> and then release the writes.
> With the new code there is no way for an unplug event to wake up the raid1d
> thread to start the writeout - I haven't tested it but I suspect it will just
> hang.
> Similarly for RAID5 I gather write bios (long before they become 'struct
> request' which is what the plugging code understands) and on an unplug event
> I release the writes - hopefully with enough bios per stripe so that we don't
> need to pre-read.
> Possibly the simplest fix would be to have a second list_head in 'struct
> blk_plug' which contained callbacks (a function pointer a list_head in a
> struct which is passed as an arg to the function!).
> blk_finish_plug could then walk the list and call the call-backs.
> It would be quite easy to hook into that.

I've implemented this and it seems to work.
Jens: could you please review and hopefully ack the patch below, and let
me know if you will submit it or should I?

My testing of this combined with some other patches which cause various md
personalities to use it shows up a bug somewhere.

The symptoms are crashes in various places in blk-core and sometimes
list_sort occurs fairly often included in the stack but not always.

This patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 273d60b..903ce8d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
struct request_queue *q;
unsigned long flags;
struct request *rq;
+ struct list_head head;

BUG_ON(plug->magic != PLUG_MAGIC);

if (list_empty(&plug->list))
+ list_add(&head, &plug->list);
+ list_del_init(&plug->list);

if (plug->should_sort)
- list_sort(NULL, &plug->list, plug_rq_cmp);
+ list_sort(NULL, &head, plug_rq_cmp);
+ plug->should_sort = 0;

q = NULL;
- while (!list_empty(&plug->list)) {
- rq = list_entry_rq(plug->;
+ while (!list_empty(&head)) {
+ rq = list_entry_rq(;
BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG));

makes the symptom go away. It simply moves the plug list onto a separate
list head before sorting and processing it.
My test was simply writing to a RAID1 with dd:
while true; do dd if=/dev/zero of=/dev/md0 size=4k; done

Obviously all writes go to two devices so the plug list will always need

The only explanation I can come up with is that very occasionally schedule on
2 separate cpus calls blk_flush_plug for the same task. I don't understand
the scheduler nearly well enough to know if or how that can happen.
However with this patch in place I can write to a RAID1 constantly for half
an hour, and without it, the write rarely lasts for 3 minutes.

If you want to reproduce my experiment, you can pull from
git:// plug-test
to get my patches for plugging in md (which are quite ready for submission
but seem to work), create a RAID1 using e.g.
mdadm -C /dev/md0 --level=1 --raid-disks=2 /dev/device1 /dev/device2
while true; do dd if=/dev/zero of=/dev/md0 bs=4K ; done