Re: PATCH 2.6.21-rc1 aoe: handle zero _count pages in bios
From: Andrew Morton
Date: Thu Mar 01 2007 - 22:23:23 EST
On Fri, 2 Mar 2007 02:29:19 +0000 Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> On Thu, Mar 01, 2007 at 05:42:04PM -0800, Andrew Morton wrote:
> > Something funny is going on here.
>
> Not so funny for those who've tried to sort out the issue over
> the past years and just got ignored..
>
> > Generally, one should increment the refcount of a page when it is put into
> > some container. That means that the page should get +1 when it is added to
> > a bio. (direct-io does this, but the mpage.c pagecache code cheats, and
> > relies upon PG_locked and PG-writeback protecting the page).
>
> It's a slab page, and slab pages aren't refcounted (which is a good thing
> as you don't own the whole page)
ah, I see.
> > Similarly, the network code (or its caller) should be incrementing the
> > page's refcount as the page goes into a container (ie: the skb) and
> > decrementing it as the page is removed.
> >
> > But someone somewhere is breaking those rules. Who?
>
> slab code.
Well I spose slab _could_ take a ref on these pages.
> > So. Who is breaking refcounting protocol here? Perhaps it is AOE, failing
> > to increment the refcount on pages as they are added to an skb?
> >
> > (Do we know which callsite in XFS is adding zero-ref pages to a BIO, btw?)
>
> For example all log I/O is done from kmalloce pages.
>
> Anyway, to rehash what I've been trying to get clarified for ages:
>
>
> (1) should we allow to pass slab pages into bios
>
> and
>
> (2) if yes what's the way lower layers are supposed to handle them
> for any possible refcounting operations like networking or rdma.
>
> There's also a pontial caller in ext3 that can send down kmalloc'ed
> buffers: journal_write_metadata_buffer() in need_copy_out && !done_copy_out
> case. But apparently that's an almost dead code path as I've never
> seen anyone tripping this one, it's always XFS that people report.
OK. Let's go through it.
Networking internally maintains caller memory lifetimes, and it assumes
that the caller allocated memory via __alloc_pages() - because it uses
get_page() and put_page().
BIO, however, does not internally manage caller memory lifetime. This is
because the caller's ->bi_end_io is always called, so the caller can do it.
So where we've come unstuck is in a module which has gone and fed BIO
memory into networking. The differing design philosophies are clashing.
I'm surprised this doesn't happen in other places - aren't there any other
drivers which take a BIO and stuff it down the network?
Anyway, where's the bug?
Really, I'd say it's XFS (and ext3). Even though BIO doesn't presently
manage page lifetimes, it _could_. After all, the function is called
bio_add_page(), not bio_add_virtual_address(). It's a bit hacky to kmalloc
some memory, run virt_to_page() and to then present that page to BIO even
though the caller (thanks to the slab optimisation) doesn't actually have
control of that page's lifetime.
So we have a few options to look at:
a) kludge things in AOE. Unpleasing, and might cause memory leaks
(although it won't, because the caller hasn't run bi_end_io yet).
b) Take a ref on slab pages in slab. A bit costly, perhaps.
c) teach ext3 and XFS to take a ref on these pages as they are added to
the BIOs, undo that ref in bi_end_io.
I think c)?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/