Re: [PATCH] mm readahead: Fix sys_readahead breakage by reverting 2MB limit (bug 79111)

From: Rafael Aquini
Date: Fri Oct 03 2014 - 16:57:53 EST

Next message: Peter Zijlstra: "Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering"
Previous message: Paolo Bonzini: "Re: [PATCH 01/17] mm: gup: add FOLL_TRIED"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jul 03, 2014 at 02:58:54PM -0700, Linus Torvalds wrote:
> On Thu, Jul 3, 2014 at 12:43 PM, John Stoffel <john@xxxxxxxxxxx> wrote:
> >
> > This is one of those perenial questions of how to tune this. I agree
> > we should increase the number, but shouldn't it be based on both the
> > amount of memory in the machine, number of devices (or is it all just
> > one big pool?) and the speed of the actual device doing readahead?
>
> Sure. But I don't trust the throughput data for the backing device at
> all, especially early at boot. We're supposed to work it out for
> writeback over time (based on device contention etc), but I never saw
> that working, and for reading I don't think we have even any code to
> do so.
>
> And trying to be clever and basing the read-ahead size on the node
> memory size was what caused problems to begin with (with memory-less
> nodes) that then made us just hardcode the maximum.
>
> So there are certainly better options - in theory. In practice, I
> think we don't really care enough, and the better options are
> questionably implementable.
>
> I _suspect_ the right number is in that 2-8MB range, and I would
> prefer to keep it at the low end at least until somebody really has
> numbers (and preferably from different real-life situations).
>
> I also suspect that read-ahead is less of an issue with non-rotational
> storage in general, since the only real reason for it tends to be
> latency reduction (particularly the "readahead()" kind of big-hammer
> thing that is really just useful for priming caches). So there's some
> argument to say that it's getting less and less important.
>

I believe you're right, but yet we sort of broke the expectation for
deliberately issued readaheads, and that is what I believe that fellow
complained (poorly worded) at the bugzilla ticket. We recently got the
following report: https://bugzilla.redhat.com/show_bug.cgi?id=1103240
which pretty much is the same thing reported at kernel's BZ. I did some
empirical tests with iozone (forcing madv_willneed behaviour) as well as
I double-checked numbers our performance team got while running their
regression tests and I, honestly, couldn't see any change for better or
worse that could be directly related to the change in question.

Other than setting a hard ceiling of to 2MB for any issued readahead,
which might be seen as trouble for certain users, there seems to be no
other measurable loss here. OTOH, the tangible gain after the change is
having the readahead working for NUMA layouts where some CPUs are within
a memoryless node.

I believe we could take David's (Rientjes) early suggestion and, instead
of fixing a hard limit on max_sane_readahead(), change it to replace
numa_node_id() by numa_mem_id() calls and follow up the
CONFIG_HAVE_MEMORYLESS_NODES requirements on PPC to have it working
properly (which seems to be the reason that approach was left aside).

Best regards,
-- Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Peter Zijlstra: "Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering"
Previous message: Paolo Bonzini: "Re: [PATCH 01/17] mm: gup: add FOLL_TRIED"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]