The slab i_ino recycling approach works somewhat, but is unfortunately neutered
quite a lot by the fact that slab recycling is per-memcg. That is, replacing
with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
callsites only leads to about 10% slab reuse, which doesn't really stem the
bleeding of 32-bit inums on an affected workload:
# tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
4454 recycle_or_get_next_ino: not recycled
546 recycle_or_get_next_ino: recycled
Too bad..
Maybe recycled ino should be implemented all the same because it is simple
and may improve workloads that are not so MEMCG intensive.
Roman (who I've just added to cc) tells me that currently we only have
per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
contributes fairly significantly here since there are multiple tasks across
multiple cgroups which are contributing to the get_next_ino() thrash.
I think this is a good start, but we need something of a different magnitude in
order to actually solve this problem with the current slab infrastructure. How
about something like the following?
1. Add get_next_ino_full, which uses whatever the full width of ino_t is
2. Use get_next_ino_full in tmpfs (et al.)
I would prefer that filesystems making heavy use of get_next_ino, be converted
to use a private ino pool per sb:
ino_pool_create()
ino_pool_get_next()
flags to ino_pool_create() can determine the desired ino range.
Does the Facebook use case involve a single large tmpfs or many
small ones? I would guess the latter and therefore we are trying to solve
a problem that nobody really needs to solve (i.e. global efficient ino pool).
3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
pass if they want the 32-bit inode numbers back. This would still allow
people who want to make this tradeoff to use xino.
inode32|inode64 (see man xfs(5)).
4. (If you like) Also add a CONFIG option to disable this at compile time.
I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
might me best determined by CONFIG option, so distro builders could decide
if they want to take the risk of breaking applications on tmpfs.
But if you implement per sb ino pool, maybe inode64 will no longer be
required for your use case?