Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling touse bmap to file blocks.

From: Neil Brown
Date: Sun May 14 2006 - 20:30:19 EST

Next message: John Lenz: "Re: MMC drivers for 2.6 collie"
Previous message: Jon Smirl: "Re: Add a "enable" sysfs attribute to the pci devices to allow userspace (Xorg) to enable devices without doing foul direct access"
In reply to: Andrew Morton: "Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling touse bmap to file blocks."
Next in thread: Andrew Morton: "Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling touse bmap to file blocks."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

(replying to bits of several emails)

On Friday May 12, akpm@xxxxxxxx wrote:
> Neil Brown <neilb@xxxxxxx> wrote:

> > However some IO requests cannot complete until the filesystem I/O
> > completes, so we need to be sure that the filesystem I/O won't block
> > waiting for memory, or fail with -ENOMEM.
>
> That sounds like a complex deadlock. Suppose the bitmap writeout requres
> some writeback to happen before it can get enough memory to proceed.
>

Exactly. Bitmap writeout must not block on fs-writeback. It can block
on device writeout (e.g. queue congestion or mempool exhaustion) but
it must complete without waiting in the fs layer or above, and without
the possibility of any error other -EIO. Otherwise we can get
deadlocked writing to the raid array. bh_submit (or bio_submit) is
certain to be safe in this respect. I'm not so confident about
anything at the fs level.

> > > Read it with O_DIRECT :(
> >
> > Which is exactly what the next release of mdadm does.
> > As the patch comment said:
> >
> > : With this approach the pagecache may contain data which is inconsistent with
> > : what is on disk. To alleviate the problems this can cause, md invalidates
> > : the pagecache when releasing the file. If the file is to be examined
> > : while the array is active (a non-critical but occasionally useful function),
> > : O_DIRECT io must be used. And new version of mdadm will have support for this.
>
> Which doesn't help `od -x' and is going to cause older mdadm userspace to
> mysteriously and subtly fail. Or does the user<->kernel interface have
> versioning which will prevent this?
>

As I said: 'non-critical'. Nothing important breaks if reading the
file gets old data. Reading the file while the array is active is
purely a curiosity thing. There is information in /proc/mdstat which
gives a fairly coarse view of the same data. It could lead to some
confusion, but if a compliant mdadm comes out before this gets into a
mainline kernel, I doubt there will be any significant issue.

Read/writing the bitmap needs to work reliably when the array is not
active, but suitable sync/invalidate calls in the kernel should make
that work perfectly.

I know this is technically a regression in user-space interface, and
you don't like such regression with good reason.... Maybe I could call
invalidate_inode_pages every few seconds or whenever the atime
changes, just to be on the safe side :-)

> > I have a patch which did that,
> > but decided that the possibility of kmalloc failure at awkward times
> > would make that not suitable.
>
> submit_bh() can and will allocate memory, although most decent device
> drivers should be OK.
>

submit_bh (like all decent device drivers) uses a mempool for memory
allocation so we can be sure that the delay in getting memory is
bounded by the delay for a few IO requests to complete, and we can be
sure the allocation won't fail. This is perfectly fine.

> >
> > I don't think a_ops really provides an interface that I can use, partly
> > because, as I said in a previous email, it isn't really a public
> > interface to a filesystem.
>
> It's publicer than bmap+submit_bh!
>

I don't know how you can say that.

bmap is so public that it is exported to userspace through an IOCTL
and is used by lilo (admitted only for reading, not writing). More
significantly it is used by swapfile which is a completely independent
subsystem from the filesystem. Contrast this with a_ops. The primary
users of a_ops are routines like generic_file_{read,write} and
friends. These are tools - library routines - that are used by
filesystems to implement their 'file_operations' which are the real
public interface. As far as these uses go, it is not a public
interface. Where a filesystem doesn't use some library routines, it
does not need to implement the matching functionality in the a_op
interface.

The other main user is the 'VM' which might try to flush out or
invalidate pages. However the VM isn't using this interface to
interact with files, but only to interact with pages, and it doesn't
much care what is done with the pages providing they get clean, or get
released, or whatever.

The way I perceive Linux design/development, active usage is far more
significant than documented design. If some feature of an interface
isn't being actively used - by in-kernel code - then you cannot be
sure that feature will be uniformly implemented, or that it won't
change subtly next week.

So when I went looking for the best way to get md/bitmap to write to a
file, I didn't just look at the interface specs (which are pretty
poorly documented anyway), I looked at existing code.
I can find 3 different parts of the kernel that write to a file.
They are
swap-file
loop
nfsd

nfsd uses vfs_read/vfs_write which have too many failure/delay modes
for me to safely use.
loop uses prepare_write/commit_write (if available) or f_op->write
(not vfs_write - I wonder why) which is not much better than what
nfsd uses. And as far as I can tell loop never actually flushes data to
storage (hope no-one thinks a journalling filesystem on loop is a
good idea) so it isn't a good model to follow.
[[If loop was a good model to follow, I would have rejected the
code for writing to a file in the first place, only accepted code
for writing to a device, and insisted on using loop to close the gap]]

That leaves swap-file as the only in-kernel user of a filesystem that
accesses the file in a way similar to what I need. Is it any surprise
that I chose to copy it?

> > >
> > > I haven't looked at this patch at all closely yet. Do I really need to?
> >
> > I assume you are asking that because you hope I will retract the
> > patch.
>
> Was kinda hoping that would be the outcome. It's rather gruesome, using
> set_fs()+vfs_read() on one side and submit_bh() on the other.
>

More gruesome than a_op->readpage on one side and bmap/submit_bio
on the other side like swapfile does? However if it makes you happy I
can change the read side to anything that you suggest - submit_bh
would be the 'obvious' choice I guess.

> Are you sure the handling at EOF for a non-multiple-of-PAGE_SIZE file
> is OK?

I think so yes, but I will review it.
I assume that if bmap gave me a non-zero block number for the last
(partial) block, then it is safe to write that whole block. I guess
that is a subtlety in semantics that no other user would be using.
However if a filesystem were packing the tail, then it would not be
able to return an integral block number for some fragments, so
hopefully it wouldn't for any.

>
> But if the pagecache is dirty then invalidate_inode_pages() will leave it
> dirty and the VM will later write it back, corrupting your bitmap file.
> You should get i_writecount, fsync the file and then run
> invalidate_inode_pages().

Well, as I am currently reading the file through the pagecache.... but
yes, I should put an fsync in there (and the invalidate before read is
fairly pointless. It is the invalidate after close that is important).

>
> You might as well use kernel_read() too, if you insist on begin oddball ;)

I didn't know about that one..

If you would feel more comfortable with kernel_read I would be more
than happy to use it.

In fact, if you can assure me that if I have an inode, and I have
confirmed that
bdi_cap_writeback_dirty(inode->i_mapping->backing_dev_info)
&& inode->i_mapping->a_ops->readpage != NULL

and have disabled truncation, either via i_write_count or otherwise,
then
- read_cache_page( .... ..a_ops->readpage ....)
will return a page that will, as long as I hold a counted reference,
remain in the page cache for that file at the appropriate address
(even though I don't hold it locked)
- set_page_dirty and write_one_page
will not fail, except with -EIO, and will not block waiting for
any writeback except for writeback of that file (e.g. won't do
any unprotected kmallocs)

then I'll be happy to go back to using that interface, though I guess I'd
have to figure out what to do about redirty_page_for_writepage calls...

But if you can assure me of the above, then I'd be curious as to why
swapfile doesn't use the readpage/writepage interface....

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: John Lenz: "Re: MMC drivers for 2.6 collie"
Previous message: Jon Smirl: "Re: Add a "enable" sysfs attribute to the pci devices to allow userspace (Xorg) to enable devices without doing foul direct access"
In reply to: Andrew Morton: "Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling touse bmap to file blocks."
Next in thread: Andrew Morton: "Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling touse bmap to file blocks."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]