Re: [RFC 0/8] Cpuset aware writeback
From: Andrew Morton
Date: Tue Jan 16 2007 - 20:09:03 EST
> On Tue, 16 Jan 2007 16:16:30 -0800 (PST) Christoph Lameter <clameter@xxxxxxx> wrote:
> On Tue, 16 Jan 2007, Andrew Morton wrote:
>
> > It's a workaround for a still-unfixed NFS problem.
>
> No its doing proper throttling. Without this patchset there will *no*
> writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each
> and a cpuset that only spans one node.
>
> Then a process runniung in that cpuset can dirty all of memory and still
> continue running without writeback continuing. background dirty ratio
> is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever
> be reached because the process will only ever be able to dirty memory on
> one node which is 5%. There will be no throttling, no background
> writeback, no blocking for dirty pages.
>
> At some point we run into reclaim (possibly we have ~99% of of the cpuset
> dirty) and then we trigger writeout. Okay so if the filesystem / block
> device is robust enough and does not require memory allocations then we
> likely will survive that and do slow writeback page by page from the LRU.
>
> writback is completely hosed for that situation. This patch restores
> expected behavior in a cpuset (which is a form of system partition that
> should mirror the system as a whole). At 10% dirty we should start
> background writeback and at 40% we should block. If that is done then even
> fragile combinations of filesystem/block devices will work as they do
> without cpusets.
Nope. You've completely omitted the little fact that we'll do writeback in
the offending zone off the LRU. Slower, maybe. But it should work and the
system should recover. If it's not doing that (it isn't) then we should
fix it rather than avoiding it (by punting writeback over to pdflush).
Once that's fixed, if we determine that there are remaining and significant
performance issues then we can take a look at that.
>
> > > Yes we can fix these allocations by allowing processes to allocate from
> > > other nodes. But then the container function of cpusets is no longer
> > > there.
> > But that's what your patch already does!
>
> The patchset does not allow processes to allocate from other nodes than
> the current cpuset.
Yes it does. It asks pdflush to perform writeback of the offending zone(s)
rather than (or as well as) doing it directly. The only reason pdflush can
sucessfuly do that is because pdflush can allocate its requests from other
zones.
>
> AFAIK any filesyste/block device can go oom with the current broken
> writeback it just does a few allocations. Its a matter of hitting the
> sweet spots.
That shouldn't be possible, in theory. Block IO is supposed to succeed if
*all memory in the machine is dirty*: the old
dirty-everything-with-MAP_SHARED-then-exit problem. Lots of testing went
into that and it works. It also failed on NFS although I thought that got
"fixed" a year or so ago. Apparently not.
> > But we also can get into trouble if a *zone* is all-dirty. Any solution to
> > the cpuset problem should solve that problem too, no?
>
> Nope. Why would a dirty zone pose a problem? The proble exist if you
> cannot allocate more memory.
Well one example would be a GFP_KERNEL allocation on a highmem machine in
whcih all of ZONE_NORMAL is dirty.
> If a cpuset contains a single node which is a
> single zone then this patchset will also address that issue.
>
> If we have multiple zones then other zones may still provide memory to
> continue (same as in UP).
Not if all the eligible zones are all-dirty.
> > > Yes, but when we enter reclaim most of the pages of a zone may already be
> > > dirty/writeback so we fail.
> >
> > No. If the dirty limits become per-zone then no zone will ever have >40%
> > dirty.
>
> I am still confused as to why you would want per zone dirty limits?
The need for that has yet to be demonstrated. There _might_ be a problem,
but we need test cases and analyses to demonstrate that need.
Right now, what we have is an NFS bug. How about we fix it, then
reevaluate the situation?
A good starting point would be to show us one of these oom-killer traces.
> Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running
> on the first node. Then we copy a large file to disk. Node local
> allocation means that we allocate from the first node. After we reach 40%
> of the node then we throttle? This is going to be a significant
> performance degradation since we can no longer use the memory of other
> nodes to buffer writeout.
That was what I was referring to.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/