On Tue, Oct 06, 2015 at 01:33:21PM -0400, Waiman Long wrote:
Yes, it may be, but that does not mean we should optimise for it.
If you are doing filesystem scalability testing on small filesystems
near capacity, then your testing methodology is needs fixing. Not
the code.
It's a pretty damn small slowdown in your somewhat extreme,Well, I am still worrying that corner cases when the slowpath isSo there is no problem that needs fixing.... ;)XFS trades off low overhead for fast path allocation with slowdownsI don't think I am going to see the slowdown that I observed on
as we near ENOSPC in allocation routines. It gets harder to find
contiguous free space, files get more fragmented, IO takes longer
because we seek more, etc. Hence we accept that performance slows
down as as the need for precision increases as we near ENOSPC.
I'd suggest you retry your benchmark with larger filesystems, and
see what happens...
larger filesystems with more free space.
triggered. I would like to make it perform better in those cases.
artificial test. Show me a real world production system that runs
small fileystems permanently at>99% filesystem capacity, and them
maybe vwe've got something that needs changing.
No. Please test the change to the per-cpu counters that I suggested:I am thinking of making the following changes:gauge how far it is from ENOSPC. So we don't really need to getI think you are looking in the wrong place. There is nothing
the precise count as long as number of CPUs are taken into
consideration in the comparison.
wrong with XFS doing two compares here. If we are hitting the
__percpu_counter_compare() slow path too much, then we should be
understanding exactly why that slow path is being hit so hard so
often. I don't see any analysis of the actual per-cpu counter
behaviour and why the slow path is being taken so often....
Well, yes. Why do you think the above function does the same/*I don't think that will work as some other CPUs may change the
* Aggregate the per-cpu counter magazines back into the global
* counter. This avoids the need for repeated compare operations to
* run the slow path when the majority of the counter value is held
* in the per-cpu magazines. Folding them back into the global
* counter means we will continue to hit the fast
* percpu_counter_read() path until the counter value falls
* completely within the comparison limit passed to
* __percpu_counter_compare().
*/
static s64 percpu_counter_aggregate(struct percpu_counter *fbc)
{
s64 ret;
int cpu;
unsigned long flags;
raw_spin_lock_irqsave(&fbc->lock, flags);
ret = fbc->count;
for_each_online_cpu(cpu) {
s32 count = __this_cpu_read(*fbc->counters);
ret += count;
__this_cpu_sub(*fbc->counters, count)
}
fbc->count = ret;
raw_spin_unlock_irqrestore(&fbc->lock, flags);
return ret;
}
percpu counters values between percpu_counter_aggregate() and
__percpu_counter_compare(). To be safe, the precise counter has to
be compted whenever the comparison value difference is less than
nr_cpus * batch size.
function as percpu_counter_sum()? So that the percpu_counter_sum()
call *inside* __percpu_counter_compare() can be replaced by this
call. i.e.
return -1;
}
/* Need to use precise count */
- count = percpu_counter_sum(fbc);
+ count = percpu_counter_aggregate(fbc);
if (count> rhs)
return 1;
else if (count< rhs)
Please think about what I'm saying rather than dismissing it without
first understanding my suggestions.