Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty
From: Dave Kleikamp
Date: Tue Nov 09 2004 - 09:21:21 EST
Andrew & Andrea,
What is the status of this patch? It would be nice to have it in the
-mm4 kernel.
Thanks,
Shaggy
On Fri, 2004-10-22 at 18:24, Andrew Morton wrote:
> Andrea Arcangeli <andrea@xxxxxxxxxx> wrote:
> >
> > On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote:
> > > Yeah. I think the only 100% reliable way to do this is:
> > >
> > > lock_page()
> > > unmap_mapping_range(page)
> > > ClearPageUptodate(page);
> > > invalidate(page); /* Try to free the thing */
> > > unlock_page(page);
> > >
> > > (Can do it for a whole range of pages if we always agree to lock pages in
> > > file-index-ascending order).
> > >
> > > but hrm, we don't even have the locking there to prevent do_no_page() from
> > > reinstantiating the pte immediately after the unmap_mapping_range().
> > >
> > > So what to do?
> > >
> > > - The patch you originally sent has a race window which can be made
> > > nonfatal by removing the BUGcheck in mpage_writepage().
> > >
> > > - NFS should probably be taught to use unmap_mapping_range() regardless
> > > of what we do, so the existing-mmaps-hold-stale-data problem gets fixed
> > > up.
> > >
> > > - invalidate_inode_pages2() isn't successfully invalidating the page if
> > > it has buffers.
> >
> > I did it right in 2.4, somebody obviously broken the thing in 2.6. Now
> > w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers
> > regardless if the page is mapped or not (so this isn't a corner case).
> >
> > define block_invalidate_page(page) discard_bh_page(page, 0, 0)
> >
> > } else {
> > if (page->buffers) {
> > /* Restart after this page */
> > list_del(head);
> > list_add_tail(head, curr);
> >
> > page_cache_get(page);
> > spin_unlock(&pagecache_lock);
> > block_invalidate_page(page);
> > } else
> > unlocked = 0;
> >
> > ClearPageDirty(page);
>
> This is incorrect for ext3 - we have to go through the a_ops to invalidate
> the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3
> patch.
>
> > The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate
> > an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6
> > broke the O_DIRECT vs buffered I/O coherency protocol.
>
> Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK
> in 2.6.
>
> > The semantics of not uptodate pages mapped in the address space are as
> > well understood and working fine in 2.4.
>
> I'd prefer to continue to disallow the mapping of nonuptodate pages into
> process address space. It's indistinguishable from a security hole and I
> think we can indeed preserve this invariant while fixing the mmap coherency
> problem.
>
> > > This is the biggest problem, because the pages can trivially have
> > > buffers - just write()ing to them will attach buffers, if they're ext2-
> > > or ext3-backed.
> > >
> > > It means that a direct-IO write to a section of a file which is mmapped
> > > causes pagecache and disk to get out of sync. Question is: do we care,
> > > or do we say "don't do that"? Nobody seems to have noticed thus far and
> > > it's a pretty dopey thing to be doing.
> >
> > this is a supported feature. People is writing with O_DIRECT and reading
> > with buffered read and we must support that (think a database writing
> > and a tar.gz reading). The coherency is a must after the O_DIRECT has
> > completed. If people runs the operations at the same time the result is
> > undefined. But by serializing the I/O at the userspace level (like
> > stop_db(); tar.gz) we must provide guarantees to them.
>
> If the page is not mapped into pagetables then we'll invalidate it OK even
> if it's backed by ext3. The only hole I can see is if someone comes in and
> dirties the page via write() _after_ generic_file_direct_IO() has done its
> filemap_write_and_wait().
>
> And I agree that in that case we should propagate the
> invalidate_complete_page() failure back to the diret-io write() caller. I wonder
> how - via a short write, or via -EFOO?
>
> Probably we should sync all the ptes prior to a direct-io read as well.
>
> > > If we _do_ want to fix it, it seems the simplest approach would be to
> > > nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we
> > > raced - try again" loop in there to handle the "do_no_page()
> > > reinstantiated a pte" race.
> >
> > the do_no_page don't need to be coherent. The pte is the last thing we
> > can care about. The API only guarantees read/write syscalls to be
> > coherent, mmap not.
>
> I want to fix the mmap case too. We currently have a half-fix for NFS
> which doesn't work at all for (say) ext3.
>
> > the only bug is the bh and it has to be fixed as easily as I did in 2.4.
>
> No, we need to call into the filesystem to kill the buffer_heads.
>
> > > Something like this? Slow as a dog, but possibly correct ;)
> > >
> > >
> > > void invalidate_inode_pages2(struct address_space *mapping)
> > > {
> > > struct pagevec pvec;
> > > pgoff_t next = 0;
> > > int i;
> > >
> > > pagevec_init(&pvec, 0);
> > > while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> > > for (i = 0; i < pagevec_count(&pvec); i++) {
> > > struct page *page = pvec.pages[i];
> > >
> > > lock_page(page);
> > > if (page->mapping == mapping) { /* truncate race? */
> > > wait_on_page_writeback(page);
> > > next = page->index + 1;
> > > while (page_mapped(page)) {
> > > unmap_mapping_range(mapping,
> > > page->index << PAGE_CACHE_SHIFT,
> > > page->index << PAGE_CACHE_SHIFT+1,
> > > 0);
> > > clear_page_dirty(page);
> > > }
> > > invalidate_complete_page(mapping, page);
> > > }
> > > unlock_page(page);
> > > }
> > > pagevec_release(&pvec);
> > > cond_resched();
> > > }
> > > }
> >
> > You know the page_mapped path isn't really very important as far as we
> > execute both clear_page_dirty and ClearPageUptodate.
>
> It's important for NFS - it will allow mmaps to see server-side updates.
>
> > We must provide the
> > guarantee to the syscall. Leaving not updoate pages in the ptes is
> > fine, the VM can deal with that. the semantics of that condition is that
> > the page changed on disk due O_DIRECT from under us, so the page isn't
> > uptodate anymore. If something we can argue about ext3 not being capable
> > of dealing with a PG_update going away from under it, but the VM is
> > capable of that, there is only one spot that must be aware about it
> > (zap fix we posted to start the thread).
> >
> > BTW, we have these fixes in our tree to make it work:
> >
> > --- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200
> > +++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200
> > @@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr
> > if (page->mapping == mapping) { /* truncate race? */
> > wait_on_page_writeback(page);
> > next = page->index + 1;
> > - if (page_mapped(page))
> > + if (page_mapped(page)) {
> > + ClearPageUptodate(page);
> > clear_page_dirty(page);
> > - else
> > + } else
> > invalidate_complete_page(mapping, page);
> > }
> > unlock_page(page);
>
> That's in 2.6.9.
>
> > and then this fix for BLKFLSBUF:
> >
> > diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c
> > --- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200
> > +++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200
> > @@ -86,6 +86,12 @@ invalidate_complete_page(struct address_
> > write_unlock_irq(&mapping->tree_lock);
> > return 0;
> > }
> > +
> > + BUG_ON(PagePrivate(page));
> > + if (page_count(page) != 2) {
> > + write_unlock_irq(&mapping->tree_lock);
> > + return 0;
> > + }
> > __remove_from_page_cache(page);
> > write_unlock_irq(&mapping->tree_lock);
> > ClearPageUptodate(page);
> > @@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr
> > clear_page_dirty(page);
> > ClearPageUptodate(page);
> > } else {
> > - invalidate_complete_page(mapping, page);
> > + if (!invalidate_complete_page(mapping,
> > + page)) {
> > + clear_page_dirty(page);
> > + ClearPageUptodate(page);
> > + }
> > }
> > }
> > unlock_page(page);
> >
>
> That's in 2.6.9 as well. I need to think about this a bit more - the first
> hunk is for read() versus BLKFLSBUF (if someone else is playing with the
> page, leave it alone) whereas the second hunk is kinda buggy, because it
> will reintroduce the EIO-in-read() error which the first hunk tried to fix.
>
> (You're using an rwlock for tree_lock. It's still unclear to me whether we
> should go that way - it helps big SMP and hurts small SMP. Did you perform
> an evaluation of this?)
>
> >
> > apparently those weren't submitted to mainline yet, that could have
> > created some confusion on why we submitted the third patch that avoids
> > the BUG in pdflush and it avoids to try writing to disk a page that is
> > not uptodate anymore. Do you see why it's wrong to pass down to pdflush
> > a page that isn't utpodate?
>
> Sure. That's why there's a BUG() there ;)
>
> > That page must be re-read, it sure must not
> > be written out. The semantics of mapped pages not uptodate are very well
> > defined and they're pose no risk to the VM (modulo the bug in pdflush
> > and this clear improvement in zap).
> >
> > > The unmapping from pagetables means that invalidate_complete_page() will
> > > generally successfully remove the page from pagecache. That mostly fixes
> > > the direct-write-of-mmapped-data coherency problem: the pages simply aren't
> > > in pagecache any more so we'll surely redo physical I/O.
> >
> > the direct-write-of-mmapped-data coherency is the last thing we can care
> > about.
>
> But we can fix it.
>
> > Even before thinking about direct-write-of-mmapped-data we must
> > fix direct-write-of-cached-data (assuming it's not mmapped cache). That
> > is a showstopper bug,
>
> I don't think we have a bug here. After the sync(), ext3_releasepage()
> will be able to release the buffers. But we should check for failures.
>
> The problem we're seeing is with mmapped pages. In that case, clearing
> PG_uptodate is insufficient and directly invalidating the buffer_heads will
> probably kill ext3 and we're not allowed to assume that page->private
> contains bh's anyway...
>
> Probably it is sufficient to run try_to_release_page() in the page_mapped
> leg of invalidate_inode_pages2(), and to check the return value.
>
> > the direct-write-of-mmapped-data is "undefined" by
> > the linux API. Garbage can end up in the pagecache as far as the kernel
> > is concerned (all we guarantee is that no random page contents are
> > delivered to userspace).
> >
> > > But it's not 100% reliable because ->invalidatepage isn't 100% reliable and
> > > it really sucks that we're offering behaviour which only works most of the
> > > time :(
> >
> > That don't need to ever work, I don't think we even need to add a pte
> > shootdown, the behviour has never been provided by design. All we need
> > to do is to fix the bh layer, and fixup the spots in zap to stop bugging
> > out on dirty not uptodate pages, plus the above patches in this email,
> > then it'll work as well as 2.4 and it will avoid to break apps. Note
> > that this is a very serious matter, what happens is that oracle writes
> > with O_DIRECT and when you do a tar.gz of the database the last block
> > may be corrupt if it's not 512byte aligned if it was a partial page and
> > it had partial bh. mysql uses O_DIRECT too.
>
> The current buffered-vs-direct handling looks OK, as long as the pages
> aren't mmapped, and as long as nobody dirties a page after
> generic_file_direct_IO() has done filemap_write_and_wait().
>
> For the former I'd like to unmap the pages. For the latter we should check
> the invalidate_complete_page() return value.
>
> > I'm not particularly
> > concerned about ext3, that can be fixed too, or we can add a new bitflag
> > as you suggested to avoid fixing the filesystems to deal with that. But
> > for the VM I'm confortable clearing the uptodate page is much simpler
> > and it poses no risk to the VM. For the fs you may be very well right
> > that the BH_reread_from_disk might be needed instead.
> >
> > So let's forget the mmapped case and let's fix the bh screwup in 2.6.
> > Then we can think at the mmapped case.
>
> The "bh screwup" _is_ the mmapped case. Maybe I said something different
> yesterday - it's been a while since I looked in there.
>
> > And to fix the mmapped case IMHO all we have to do is:
> >
> > clear_page_dirty
> > ClearPageUptodate
> > unmap_page_range
> >
> > Not the other way around. The uptodate bit must be clared before zapping
> > the pte to force a reload.
>
> Something like this...
>
>
>
>
> - When invalidating pages, take care to shoot down any ptes which map them
> as well.
>
> This ensures that the next mmap access to the page will generate a major
> fault, so NFS's server-side modifications are picked up.
>
> This also allows us to call invalidate_complete_page() on all pages, so
> filesytems such as ext3 get a chance to invalidate the buffer_heads.
>
> - Don't mark in-pagetable pages as non-uptodate any more. That broke a
> previous guarantee that mapped-into-user-process pages are always uptodate.
>
> - Check the return value of invalidate_complete_page(). It can fail if
> someone redirties a page after generic_file_direct_IO() write it back.
>
> But we still have a problem. If invalidate_inode_pages2() calls
> unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache
> pages. That will redirty the page's buffers and will cause
> invalidate_complete_page() to fail.
>
> So, in generic_file_direct_IO() we do a complete pte shootdown on the file
> up-front, prior to writing back dirty pagecache. This is only done for
> O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full
> mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but
> permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke
> other people's mapped pagecache.
>
> NFS also uses invalidate_inode_pages2() for handling server-side modification
> notifications. But in the NFS case the clear_page_dirty() in
> invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry
> about the "dirty buffers against a clean page" problem. (I think)
>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxx>
> ---
>
> 25-akpm/include/linux/fs.h | 2 -
> 25-akpm/mm/filemap.c | 19 ++++++++++---
> 25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------
> 3 files changed, 55 insertions(+), 29 deletions(-)
>
> diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c
> --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004
> +++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004
> @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp
> * be marked dirty at any time too. So we re-check the dirtiness inside
> * ->tree_lock. That provides exclusion against the __set_page_dirty
> * functions.
> + *
> + * Returns non-zero if the page was successfully invalidated.
> */
> static int
> invalidate_complete_page(struct address_space *mapping, struct page *page)
> @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str
> EXPORT_SYMBOL(invalidate_inode_pages);
>
> /**
> - * invalidate_inode_pages2 - remove all unmapped pages from an address_space
> + * invalidate_inode_pages2 - remove all pages from an address_space
> * @mapping - the address_space
> *
> - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case
> - * where the page is seen to be mapped into process pagetables. In that case,
> - * the page is marked clean but is left attached to its address_space.
> - *
> - * The page is also marked not uptodate so that a subsequent pagefault will
> - * perform I/O to bringthe page's contents back into sync with its backing
> - * store.
> + * Any pages which are found to be mapped into pagetables are unmapped prior to
> + * invalidation.
> *
> - * FIXME: invalidate_inode_pages2() is probably trivially livelockable.
> + * Returns -EIO if any pages could not be invalidated.
> */
> -void invalidate_inode_pages2(struct address_space *mapping)
> +int invalidate_inode_pages2(struct address_space *mapping)
> {
> struct pagevec pvec;
> pgoff_t next = 0;
> int i;
> + int ret = 0;
> + int did_full_unmap = 0;
>
> pagevec_init(&pvec, 0);
> - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> - for (i = 0; i < pagevec_count(&pvec); i++) {
> + while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> + for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
>
> lock_page(page);
> - if (page->mapping == mapping) { /* truncate race? */
> - wait_on_page_writeback(page);
> - next = page->index + 1;
> - if (page_mapped(page)) {
> - clear_page_dirty(page);
> - ClearPageUptodate(page);
> + if (page->mapping != mapping) { /* truncate race? */
> + unlock_page(page);
> + continue;
> + }
> + wait_on_page_writeback(page);
> + next = page->index + 1;
> + while (page_mapped(page)) {
> + if (!did_full_unmap) {
> + /*
> + * Zap the rest of the file in one hit.
> + * FIXME: invalidate_inode_pages2()
> + * should take start/end offsets.
> + */
> + unmap_mapping_range(mapping,
> + page->index << PAGE_CACHE_SHIFT,
> + -1, 0);
> + did_full_unmap = 1;
> } else {
> - if (!invalidate_complete_page(mapping,
> - page)) {
> - clear_page_dirty(page);
> - ClearPageUptodate(page);
> - }
> + /*
> + * Just zap this page
> + */
> + unmap_mapping_range(mapping,
> + page->index << PAGE_CACHE_SHIFT,
> + (page->index << PAGE_CACHE_SHIFT)+1,
> + 0);
> }
> }
> + clear_page_dirty(page);
> + if (!invalidate_complete_page(mapping, page))
> + ret = -EIO;
> unlock_page(page);
> }
> pagevec_release(&pvec);
> cond_resched();
> }
> + return ret;
> }
> -
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h
> --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004
> +++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004
> @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino
> S_ISLNK(inode->i_mode))
> invalidate_inode_pages(inode->i_mapping);
> }
> -extern void invalidate_inode_pages2(struct address_space *mapping);
> +extern int invalidate_inode_pages2(struct address_space *mapping);
> extern void write_inode_now(struct inode *, int);
> extern int filemap_fdatawrite(struct address_space *);
> extern int filemap_flush(struct address_space *);
> diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c
> --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004
> +++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004
> @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file
> EXPORT_SYMBOL(generic_file_writev);
>
> /*
> - * Called under i_sem for writes to S_ISREG files
> + * Called under i_sem for writes to S_ISREG files. Returns -EIO if something
> + * went wrong during pagecache shootdown.
> */
> ssize_t
> generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki
> struct address_space *mapping = file->f_mapping;
> ssize_t retval;
>
> + /*
> + * If it's a write, unmap all mmappings of the file up-front. This
> + * will cause any pte dirty bits to be propagated into the pageframes
> + * for the subsequent filemap_write_and_wait().
> + */
> + if (rw == WRITE && mapping_mapped(mapping))
> + unmap_mapping_range(mapping, 0, -1, 0);
> +
> retval = filemap_write_and_wait(mapping);
> if (retval == 0) {
> retval = mapping->a_ops->direct_IO(rw, iocb, iov,
> offset, nr_segs);
> - if (rw == WRITE && mapping->nrpages)
> - invalidate_inode_pages2(mapping);
> + if (rw == WRITE && mapping->nrpages) {
> + int err = invalidate_inode_pages2(mapping);
> + if (err)
> + retval = err;
> + }
> }
> return retval;
> }
> -
> EXPORT_SYMBOL_GPL(generic_file_direct_IO);
> _
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/