Re: [PATCH] percpu_counter: return precise count from __percpu_counter_compare()

From: Waiman Long
Date: Wed Oct 07 2015 - 16:00:51 EST

Next message: Arnd Bergmann: "Re: [PATCH] ARM: cns3xxx: pci: avoid potential stack overflow"
Previous message: Nishanth Menon: "Re: [PATCH] ARM: configs: keystone: Add ramdisk support"
In reply to: Dave Chinner: "Re: [PATCH] percpu_counter: return precise count from __percpu_counter_compare()"
Next in thread: Dave Chinner: "Re: [PATCH] percpu_counter: return precise count from __percpu_counter_compare()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/06/2015 05:30 PM, Dave Chinner wrote:

On Tue, Oct 06, 2015 at 01:33:21PM -0400, Waiman Long wrote:
Yes, it may be, but that does not mean we should optimise for it.
If you are doing filesystem scalability testing on small filesystems
near capacity, then your testing methodology is needs fixing. Not
the code.

XFS trades off low overhead for fast path allocation with slowdowns
as we near ENOSPC in allocation routines. It gets harder to find
contiguous free space, files get more fragmented, IO takes longer
because we seek more, etc. Hence we accept that performance slows
down as as the need for precision increases as we near ENOSPC.

I'd suggest you retry your benchmark with larger filesystems, and
see what happens...

I don't think I am going to see the slowdown that I observed on
larger filesystems with more free space.

So there is no problem that needs fixing.... ;)

Well, I am still worrying that corner cases when the slowpath is
triggered. I would like to make it perform better in those cases.

It's a pretty damn small slowdown in your somewhat extreme,
artificial test. Show me a real world production system that runs
small fileystems permanently at>99% filesystem capacity, and them
maybe vwe've got something that needs changing.

gauge how far it is from ENOSPC. So we don't really need to get
the precise count as long as number of CPUs are taken into
consideration in the comparison.

I think you are looking in the wrong place. There is nothing
wrong with XFS doing two compares here. If we are hitting the
__percpu_counter_compare() slow path too much, then we should be
understanding exactly why that slow path is being hit so hard so
often. I don't see any analysis of the actual per-cpu counter
behaviour and why the slow path is being taken so often....

I am thinking of making the following changes:

No. Please test the change to the per-cpu counters that I suggested:

/*
* Aggregate the per-cpu counter magazines back into the global
* counter. This avoids the need for repeated compare operations to
* run the slow path when the majority of the counter value is held
* in the per-cpu magazines. Folding them back into the global
* counter means we will continue to hit the fast
* percpu_counter_read() path until the counter value falls
* completely within the comparison limit passed to
* __percpu_counter_compare().
*/
static s64 percpu_counter_aggregate(struct percpu_counter *fbc)
{
s64 ret;
int cpu;
unsigned long flags;

raw_spin_lock_irqsave(&fbc->lock, flags);
ret = fbc->count;
for_each_online_cpu(cpu) {
s32 count = __this_cpu_read(*fbc->counters);
ret += count;
__this_cpu_sub(*fbc->counters, count)
}
fbc->count = ret;
raw_spin_unlock_irqrestore(&fbc->lock, flags);
return ret;
}

I don't think that will work as some other CPUs may change the
percpu counters values between percpu_counter_aggregate() and
__percpu_counter_compare(). To be safe, the precise counter has to
be compted whenever the comparison value difference is less than
nr_cpus * batch size.

Well, yes. Why do you think the above function does the same
function as percpu_counter_sum()? So that the percpu_counter_sum()
call *inside* __percpu_counter_compare() can be replaced by this
call. i.e.

return -1;
}
/* Need to use precise count */
- count = percpu_counter_sum(fbc);
+ count = percpu_counter_aggregate(fbc);
if (count> rhs)
return 1;
else if (count< rhs)

Please think about what I'm saying rather than dismissing it without
first understanding my suggestions.

I understood what you were saying. However, the per-cpu counter isn't protected by the spinlock. Reading it is OK, but writing may cause race if that counter is modified by a CPU other than its owning CPU.

The slow performance of percpu_counter_sum() is due to its need to access n different (likely cold) cachelines where n is the number of CPUs in the system. So the larger the system, the more problematic it will be. My main concern about xfs_mod_fdblocks() is that it can potentially call percpu_counter_sum() twice which is what I want to prevent. It is OK if you don't think that change is necessary. However, I will come back if I find more evidence that this can be an issue.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Arnd Bergmann: "Re: [PATCH] ARM: cns3xxx: pci: avoid potential stack overflow"
Previous message: Nishanth Menon: "Re: [PATCH] ARM: configs: keystone: Add ramdisk support"
In reply to: Dave Chinner: "Re: [PATCH] percpu_counter: return precise count from __percpu_counter_compare()"
Next in thread: Dave Chinner: "Re: [PATCH] percpu_counter: return precise count from __percpu_counter_compare()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]