Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

From: NeilBrown
Date: Thu Jun 25 2009 - 07:08:10 EST


On Thu, June 25, 2009 6:00 pm, Martin K. Petersen wrote:
>>>>>> "Neil" == Neil Brown <neilb@xxxxxxx> writes:
>
> Neil,
>
> Neil> Of these:
>
> Neil> max_hw_sectors_kb, nr_requests, rq_affinity, iosched/,
> Neil> max_sectors_kb scheduler nomerges rotational
>
> Neil> are really only relevant to the elevator code and those devices
> Neil> that used that code (ide, scsi, etc).
>
> I'm not sure I completely agree with putting rotational in that bucket.
> It affects the choice of allocation policy in btrfs, for instance.
>

I confess to have some uncertainty about this. There is, as far as
I can tell, no documentation for it.

So I asked git why it as added, and it pointed to
commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec

which suggests that it was added so that user space could tell the
kernel whether the device was rotational, rather than the other way
around.

But yes, maybe 'rotational' doesn't belong with the others.

>
> Neil> Adding a number of extra fields such as minimum_io_size,
> Neil> optimal_io_size etc to '/queue' seems to increase the number of
> Neil> aberrations and enforces md and dm device to have a /queue which
> Neil> is largely irrelevant.
>
> You seem to be hung up on the fact that you don't queue things. I think
> that's beside the point. You *do* have a request_queue thanks to
> calling blk_queue_make_request() in md.c. And there is more to
> request_queue than the values you brought up. Like the callback
> functions. I'm not saying that all the values in request_queue apply to
> MD, but I really don't understand what all the fuss is about. Other
> than the presence of the string "queue" in the choice of naming.
>

Well names are very important. And as I said later we could possibly keep
them in 'queue' and make 'queue' a more generic directory. I don't like
that but it is probably better than the current situation.

As you say, I do currently have a request_queue, but that is an internal
detail, not part of the externally visible interface, and it is something
that is very much in my sights as something I want to change. I'm
still working out the details so I'm a fair way from a concrete proposal
and a long way from some code. That change certainly doesn't have
to happen in any rush. But we should get the externally visible
names "right" if we can.

> Anyway. If you look at the request_queue in the current tree you'll see
> that the very limits we are discussing are contained in a separate
> struct. We can easily move that somewhere else at a later date if that
> is deemed the right thing to do.

Agreed. But not relevant at the moment. The view from user-space is what
is important.

>
>
> Neil> I have suggested to Martin that 2 are enough.
>
> I think I have covered this in a separate mail. You are mixing up
> hardware limitations and I/O hints on the grounds that they went in as
> part of the same patch set and live in the same place.

I think I'm actually mixing them up because they look very much the
same. Both say "try to make write requests at least this big" and
I cannot see the difference being very big to the filesystem.

I tend to mix them up a bit with read_ahead_kb as they are in some
ways just the 'write' version of that. But it didn't arrive in the
same patch set :-)

Also, I think you seem to be treating the read-modify-write behaviour
of a 4K-sector hard drive as different-in-kind to the read-modify-write
behaviour of raid5. I cannot see that. In both cases an error can cause
unexpected corruption and in both cases getting the alignment right helps
throughput a lot. So the only difference between these two values
is the size.
If one is 4K and one is 40Meg and you have 512bytes of data that you
want to write as safely as possibly, you might pad it to 4K, but you
wont pad it to 40Meg.
If you have 32Meg of data that you want to write as safely as you can,
you may well pad it to 40Meg, rather than say "it is a multiple of
4K, that is enough for me".
So: the difference is only in the size.


>
> fdisk/mdadm/dmsetup need to use physical_block_size and alignment_offset
> to prevent us from misaligning when setting up partitions and virtual
> block devices. Also, when stacking devices I need to know these values
> to ensure that the I/O hints set by MD/DM don't conflict with the
> underlying hardware limitations. There are also special cases like
> shared disk setups and filesystem journal padding that may need to know
> details of the hardware atomicity.

"... of the *device* atomicity." It hardly matters whether the device is
hardware or software, it can still have atomicity issues.

>
> mkfs.* can leverage minimum_io_size and optimal_io_size hints to choose
> block sizes and to lay out data structures on stripe boundaries. Just
> like we're doing today except using a common interface for all block
> devices instead of poking at MD and LVM internals.

As we are using generic terms like "minimum" and "optimal", let's keep
with those terms when describing filesystem behaviour and not mention
stripes.
"mkfs.* can leverage these values to choose the most appropriate sizes
and alignments for various data strutures".
What is the most appropriate size? It depends on how much data we have
to store, and how much wastage we can afford. So if there are a range
of reasonably optimal sizes, the fs can choose the best fit, without
needing to pretend to understand why one is more optimal than another.

And the "Just like we're doing today ..." is actually a bit sad.
If you look at mkfs.ext3, it requires 3 values:
- block size
- stride-size
- stripe-width

block size needs to be smallish and a power of 2 and at most
PAGE_SIZE (I think). It would be ideal if this could be stripe-size on
a raid5, but the stripe is usually too large so the next best is
physical_block_size (so, a size based decision).

stripe-width is only really needed on raid4/5/6 as it is aimed at
avoiding read-modify-write, so it would be the stripe size, which would
be minimum_io_size. Though on a SCSI device it should probably
be "OPTIMAL TRANSFER LENGTH GRANULARITY" which I thought would be
optimal_io_size.. so maybe we use that and make minimum_io_size and
optimal_io_size the same on raid5 ?? Or just provide a list of useful
sizes and let the fs choose whatever is in the right ball-park.

stride-size is used for raid0 or raid4, or the "Asymmetric" raid5 layouts.
It allows ext3 to stagger which drive certain 'hot' data structures are
on. The current metrics don't allow for that at all.
I'm not saying they should, and I'm not sure how they could. But it
is sad.

>
> logical_block_size, physical_block_size and alignment_offset are
> hardware limits that need to be honored when creating a (virtual) block
> device or partition.

logical_block_size is a firmware limit. The others are just hints.
Strong hints I agree. Hints we should follow if at all possible.
But still hints.

>
> The minimum/optimal write sizes are hints to the *user* of the block
> device about how to lay out things. If you look at my MD patch you'll
> see that I only set the I/O hints. The hardware settings are off limits
> for MD.
>
>
> I don't particularly care whether we store the values in queue/,
> topology/, metrics/, limits/ or in the device root. Nor whether we call
> it minimum_write_size instead of minimum_io_size. I'll be happy to roll
> up a renaming patch...

Good, thanks.. I do strongly care that they don't go in queue/, and
that they have a name that is as close as possible to what they mean.
Let's see if anyone else has an opinion.

Thanks,
NeilBrown


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/