Re: [PATCH] allocate page caches pages in round robin fasion
From: Ray Bryant
Date: Fri Aug 13 2004 - 11:33:26 EST
Dave Hansen wrote:
On Thu, 2004-08-12 at 16:38, Jesse Barnes wrote:
On a NUMA machine, page cache pages should be spread out across the system
since they're generally global in nature and can eat up whole nodes worth of
memory otherwise. This can end up hurting performance since jobs will have
to make off node references for much or all of their non-file data.
Wouldn't this be painful for any workload that accesses a unique set of
files on each node? If an application knows that it is touching truly
shared data which every node could possibly access, then they can use
the NUMA API to cause round-robin allocations to occur.
I suppose it is possible for some workloads to be able to tell the difference
between a locally and globally allocated page cache page. It all depends on
the rate of access of data pages versus page cache pages.
For workloads that read in some data, then process that data for a very long
time (e. g. typical HPC workloads), it is more important to make sure those
data pages are allocated locally, and the page cache pages are touched much
less frequently, so making them globally round-robin'd is a marginal
performance hit. The problem we are trying to avoid here is to make sure the
node doesn't fill up with page cache pages, resulting in non-local allocations
for those data pages, which is not a good thing [tm].
On the other hand, if your workload spends most of its time writing buffered
file I/O to a set of pages that will comfotably fit on node, then it is
important to have the page cache pages allocated locally. So I can see the
need for some program control of placement.
However, using the NUMA API to cause round-robin allocations to occur would
use the process level policy, right? So the same decision will be made on how
to allocate data pages and page cache pages? Might it not be possible that an
application would like its page cache pages allocated globally round-robin,
but it still wants its data pages allocated via MPOL_DEFAULT?
Perhaps what is needed is the ability to associate a mem_policy with the page
cache allocation (or, perhaps, more generally, a default "kernel storage
allocation policy" for storage that the kernel allocates on behalf of a
process). System admins could set the default according to overall workload
considerations, and, perhaps, we would allow processes with sufficient
priviledge to set their own policy.
Maybe a per-node watermark on page cache usage would be more useful.
Once a node starts to get full, and it's past the watermark, we can go
and shoot down some of the node's page cache. If the data access is
truly global, then it has a good chance of being brought in on a
different node.
I think this could be inefficient if the file access is truly global and the
file is large. (Think of a file that is significantly larger than the local
memory of any node.) Pages would be pulled into each node in turn as they are
accessed, then discarded as they go over the watermark, to be pulled in on
another node, etc. It would be better in this case just to round robin the
allocation on first access and be done with it.
-- Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/