Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

From: Peter Zijlstra
Date: Thu Sep 13 2007 - 15:25:21 EST

Next message: Pete Zaitcev: "Re: [GIT PATCH] USB autosuspend fixes for 2.6.23-rc6"
Previous message: Yinghai Lu: "Re: [PATCH] x86_64: set cfg_size for AMD Family 10h in case MMCONFIG is used"
In reply to: Christoph Lameter: "Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)"
Next in thread: Christoph Lameter: "Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 2007-09-13 at 11:32 -0700, Christoph Lameter wrote:
> On Thu, 13 Sep 2007, Peter Zijlstra wrote:
>
> >
> > > > Every user of memory relies on the VM, and we only get into trouble if
> > > > the VM in turn relies on one of these users. Traditionally that has only
> > > > been the block layer, and we special cased that using mempools and
> > > > PF_MEMALLOC.
> > > >
> > > > Why do you object to me doing a similar thing for networking?
> > >
> > > I have not seen you using mempools for the networking layer. I would not
> > > object to such a solution. It already exists for other subsystems.
> >
> > Dude, listen, how often do I have to say this: I cannot use mempools for
> > the network subsystem because its build on kmalloc! What I've done is
> > build a replacement for mempools - a reserve system - that does work
> > similar to mempools but also provides the flexibility of kmalloc.
> >
> > That is all, no more, no less.
>
> Its different since it becomes a privileged player that can suck all
> the available memory out of the page allocator.

No, each reserve user comes with a bean-counter that will limit the
usage.

> > I'm confused by this, I've never claimed part of, or such a thing. All
> > I'm saying is that because of the circular dependency between the VM and
> > the IO subsystem used for swap (not file backed paging [*], just swap)
> > you have to do something special to avoid deadlocks.
>
> How are dirty file backed pages different? They may also be written out
> by the VM during reclaim.

when you have dirty file backed pages, the rest of the memory can only
consists of clean file pages and or anonymous pages - due to the dirty
limit. If you can guarantee that swap doesn't use memory (well, it does,
but its PF_MEMALLOC memory that cannot be used by others) then you can
always free memory by dropping clean pages or swapping out. And thus
make progress for file based writeback.

This is why the dirty page tracking made mmap over NFS useable.

> > > Replacing the mempools for the block layer sounds pretty good. But how do
> > > these various subsystems that may live in different portions of the system
> > > for various devices avoid global serialization and livelock through your
> > > system?
> >
> > The reserves are spread over all kernel mapped zones, the slab allocator
> > is still per cpu, the page allocator tries to get pages from the nearest
> > node.
>
> But it seems that you have unbounded allocations with PF_MEMALLOC now for
> the networking case? So networking can exhaust all reserves?

No, networking will beancount all PF_MEMALLOC memory it receives, and
stop allocating once it hits it limit. It knows that when it has than
much memory outstanding its guaranteed memory will be freed soon.

> > > And how is fairness addresses? I may want to run a fileserver on
> > > some nodes and a HPC application that relies on a fiberchannel connection
> > > on other nodes. How do we guarantee that the HPC application is not
> > > impacted if the network services of the fileserver flood the system with
> > > messages and exhaust memory?
> >
> > The network system reserves A pages, the block layer reserves B pages,
> > once they start getting pages from the reserves they go bean counting,
> > once they reach their respective limit they stop.
>
> That sounds good.

Ok, so next time I'll post the whole series again - I know some people
found it too much - but that way you can see the bean counter.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Pete Zaitcev: "Re: [GIT PATCH] USB autosuspend fixes for 2.6.23-rc6"
Previous message: Yinghai Lu: "Re: [PATCH] x86_64: set cfg_size for AMD Family 10h in case MMCONFIG is used"
In reply to: Christoph Lameter: "Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)"
Next in thread: Christoph Lameter: "Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]