On Mon, Oct 05, 2015 at 07:02:21PM -0400, Waiman Long wrote:
....
yes, I'm well aware of this. But systems with hundreds to thousandsHaving less than 1GB of free space in an XFS filesystem isWe have systems with more than 500 CPUs (HT on). I think SGI has
considered to be "almost ENOSPC" - when you have TB to PB of space,
less than 1GB really "moments before ENOSPC".
systems with thousands of CPUs. For those large system, the slowpath
will be triggered if there is less than 4G or 10G for those thousand
CPUs systems.
of CPUs simply do not operate their storage at this capacity.
They'll have hundreds of TB or PBs of storage attached, so if we
trigger the slow path at 10GB of free space, we are talking about
having already used> 99.9% of that capacity.
In which case, they are already in a world of pain because
filesystem allocation performance starts to degrade at>90%
capacity, and we start cutting back preallocations at>95% capacity,
and we really start to throttle ispace allocations to their
minimum possible sizes at>99% capacity. IOWs, hitting this slow
path at>99.9% capacity is really irrelevant....
What I am trying to do with my patch is to reduce theMaybe so, but we don't tend ot optimise slow paths - we trade off a
performance overhead in those cases. I have no worry for systems
that have only a few CPUs. In essence, the per-cpu counter code
doesn't scale well for systems with large number of CPUs.
really fast fast path for a slow, more complex slow path all over
the place. Not just in XFS, but all over the kernel.
So there is no problem that needs fixing.... ;)XFS trades off low overhead for fast path allocation with slowdownsI don't think I am going to see the slowdown that I observed on
as we near ENOSPC in allocation routines. It gets harder to find
contiguous free space, files get more fragmented, IO takes longer
because we seek more, etc. Hence we accept that performance slows
down as as the need for precision increases as we near ENOSPC.
I'd suggest you retry your benchmark with larger filesystems, and
see what happens...
larger filesystems with more free space.
However, I still think thatI really don't care about the CPU overhead, because it's far more
doing 2 precise count computations is wasteful.
important that:
1) the zero threshold detection is precise and correct;
2) the fast path is really fast; and
3) I understand the code well enough to be able to debug
and maintain it.
I am planning to rework my patch to disable precise count for theI think you are looking in the wrong place. There is nothing
first comparison in xfs_mod_fdblocks as that comparison is used to
gauge how far it is from ENOSPC. So we don't really need to get
the precise count as long as number of CPUs are taken into
consideration in the comparison.
wrong with XFS doing two compares here. If we are hitting the
__percpu_counter_compare() slow path too much, then we should be
understanding exactly why that slow path is being hit so hard so
often. I don't see any analysis of the actual per-cpu counter
behaviour and why the slow path is being taken so often....
Indeed, have you considered using something like this in the precise
path of __percpu_counter_compare() rather than percpu_counter_sum():
/*
* Aggregate the per-cpu counter magazines back into the global
* counter. This avoids the need for repeated compare operations to
* run the slow path when the majority of the counter value is held
* in the per-cpu magazines. Folding them back into the global
* counter means we will continue to hit the fast
* percpu_counter_read() path until the counter value falls
* completely within the comparison limit passed to
* __percpu_counter_compare().
*/
static s64 percpu_counter_aggregate(struct percpu_counter *fbc)
{
s64 ret;
int cpu;
unsigned long flags;
raw_spin_lock_irqsave(&fbc->lock, flags);
ret = fbc->count;
for_each_online_cpu(cpu) {
s32 count = __this_cpu_read(*fbc->counters);
ret += count;
__this_cpu_sub(*fbc->counters, count)
}
fbc->count = ret;
raw_spin_unlock_irqrestore(&fbc->lock, flags);
return ret;
}
Some perspective: you wouldn't have seen this behaviour with the
previous per-cpu counter code in XFS near ENOSPC. By the time it got
this close to ENOSPC it was completely serialising all access to the
free space counters with a mutex and then doing per-cpu sums under
that mutex (see commit 20b6428 ("[XFS] Reduction global superblock
lock contention near ENOSPC."). Hence it wouldn't have appeared in
your profiles, even though it was much worse in terms of contention
and lock hold times than the current code is.
This looks to be the same fundamental problem - the per-cpu counter
values are not being managed in a way that reduces minimises precise
comparison overhead. Making the above change will tell us whether
this is the case or not...