Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

From: Kent Overstreet
Date: Wed Oct 05 2016 - 02:34:44 EST


On Tue, Oct 04, 2016 at 07:55:12PM -0300, Mauricio Faria de Oliveira wrote:
> Hi Benjamin, Kent, and others,
>
> Would you please comment / answer about this possible problem?
> Any feedback is appreciated.
>
> Since commit e1bdd5f27a5b ("aio: percpu reqs_available") the maximum
> number of aio nr_events may be a function of num_possible_cpus() and
> actually be /inversely proportional/ to it (i.e., more CPUs lead to
> less system-wide aio nr_events). This is a problem on larger systems.
>
> That's because if "nr_events < num_possible_cpus() * 4" (for example
> nr_events == 1) that counts as "num_possible_cpus() * 4" into aio_nr
> and against aio_max_nr
>
> static struct kioctx *ioctx_alloc(unsigned nr_events)
> ...
> nr_events = max(nr_events, num_possible_cpus() * 4);
> nr_events *= 2;
> ...
> /* limit the number of system wide aios */
> ....
> if (aio_nr + nr_events > (aio_max_nr * 2UL) ||
> ...
> err = -EAGAIN;
> ...
> aio_nr += ctx->max_reqs;
> ...
>
> That problem is easily noticeable on a common POWER8 system: 160 CPUs
> (2 sockets * 10 cores/socket * 8 threads/core = 160 CPUs) limits the max
> AIO contexts with "io_setup(1, )" to 102 out of 64k (default ax_aio_nr):
>
> # cat /sys/devices/system/cpu/possible
> 0-159
>
> # cat /proc/sys/fs/aio-max-nr
> 65536
>
> # echo $(( 65536 / (160 * 4) ))
> 102
>
> test-case snippet & output:
>
> for (i = 0; i < 65536; i++)
> if (rc = io_setup(1, &ioctx[i]))
> break;
>
> printf("rc = %d, i = %d\n", rc, i);
>
> > rc = -11, i = 102
>
> (another problem is that the sysctl aio-nr grows larger than aio-max-nr,
> since it's checked against "aio_max_nr * 2")
>
> So,
>
> I've been trying to understand/fix this, but soon got stuck on options
> as I didn't quite get a few points.. if you could provide some insight,
> please, that would be really helpful:
>
> - why "num_possible_cpus() * 4", and why "max(nr_events, <it>)" ?

For the scheme to work - percpu allocation of slots - we have to ensure that
there aren't too many unused slots stranded on other CPUs. The stranding is
limited to 1/4th of the slots as I figured any more than that could be too
unpredictable - the effective maximum number of in flight iocbs would vary too
much.

For systems with large numbers of CPUs, what I'd prefer to do is make it per
core or numa node or somesuch. But we don't have any infrastructure for that
equivilant to the alloc_percpu() stuff, so that's why I didn't do it at the
time.