Re: dm bufio: Reduce dm_bufio_lock contention
From: Mikulas Patocka
Date: Mon Jun 18 2018 - 18:11:37 EST
On Fri, 15 Jun 2018, Michal Hocko wrote:
> On Fri 15-06-18 08:47:52, Mikulas Patocka wrote:
> >
> >
> > On Fri, 15 Jun 2018, Michal Hocko wrote:
> >
> > > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> > > >
> > > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO |
> > > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses
> > > > these flags too. dm-bufio is just a big mempool.
> > >
> > > This doesn't answer my question though. Somebody else is doing it is not
> > > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
> > > allocation AFAICS. So why do you really need it now? Why cannot you
> >
> > dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC |
> > __GFP_NOWARN" since the kernel 3.2 when it was introduced.
> >
> > In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT
> > allocation, then drops the lock and does GFP_NOIO with the dropped lock
> > (because someone was likely experiencing the same issue that is reported
> > in this thread) - there are two commits that change it - 9ea61cac0 and
> > 41c73a49df31.
>
> OK, I see. Is there any fundamental reason why this path has to do one
> round of GFP_IO or it can keep NOWAIT, drop the lock, sleep and retry
> again?
If the process is woken up, there was some buffer added to the freelist,
or refcount of some buffer was dropped to 0. In this case, we don't want
to drop the lock and use GFP_NOIO, because the freed buffer may disappear
when we drop the lock.
> [...]
> > > is the same class of problem, honestly, I dunno. And I've already said
> > > that stalling __GFP_NORETRY might be a good way around that but that
> > > needs much more consideration and existing users examination. I am not
> > > aware anybody has done that. Doing changes like that based on a single
> > > user is certainly risky.
> >
> > Why don't you set any rules how these flags should be used?
>
> It is really hard to change rules during the game. You basically have to
> examine all existing users and that is well beyond my time scope. I've
> tried that where it was possible. E.g. __GFP_REPEAT and turned it into a
> well defined semantic. __GFP_NORETRY is a bigger beast.
>
> Anyway, I believe that it would be much safer to look at the problem
> from a highlevel perspective. You seem to be focused on __GFP_NORETRY
> little bit too much IMHO. We are not throttling callers which explicitly
> do not want to or cannot - see current_may_throttle. Is it possible that
> both your md and mempool allocators can either (ab)use PF_LESS_THROTTLE
> or use other means? E.g. do you have backing_dev_info at that time?
> --
> Michal Hocko
> SUSE Labs
I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases
without a fallback - those are bugs that make various functions randomly
return -ENOMEM.
Most of the callers provide callback.
There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two
different functions - if the allocation is larger than
PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller.
If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER,
__GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations
don't trigger the oom killer at all).
So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in
the cases where the caller wants to avoid trigerring the oom killer (the
problem is that __GFP_NORETRY causes random failure even in no-oom
situations but __GFP_RETRY_MAYFAIL doesn't).
So my suggestion is - fix these obvious bugs when someone allocates memory
with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be
just changed to return NULL instead of sleeping.
arch/arm/mm/dma-mapping.c - fallback to a smaller size without __GFP_NORETRY
arch/mips/mm/dma-default.c - says that it uses __GFP_NORETRY to avoid the
oom killer, provides no fallback - it seems to be a BUG
arch/sparc/mm/tsb.c - fallback to a smaller size without __GFP_NORETRY
arch/x86/include/asm/floppy.h - __GFP_NORETRY doesn't seem to serve any
purpose, it may cause random failures during initialization, can be
removed - BUG
arch/powerpc/mm/mmu_context_iommu.c - uses it just during moving pages,
there's no problem with failure
arch/powerpc/platforms/pseries/cmm.c - a vm balloon driver, no problem
with failure
block/bio.c - falls back to mempool
block/blk-mq.c - errorneous use of __GFP_NORETRY during initialization, it
falls back to a smaller size, but doesn't drop the __GFP_NORETRY flag
(BUG)
drivers/gpu/drm/i915/i915_gem.c - it starts with __GFP_NORETRY and on
failure, it ORs it with __GFP_RETRY_MAYFAIL (which of these conflicting
flags wins?)
drivers/gpu/drm/i915/i915_gem_gtt.c - __GFP_NORETRY is used during
initialization (BUG), it shouldn't be used
drivers/gpu/drm/i915/i915_gem_execbuffer.c - fallback to a smaller size without __GFP_NORETRY
drivers/gpu/drm/i915/i915_gem_internal.c - fallback to a smaller size without __GFP_NORETRY
size
drivers/gpu/drm/i915/i915_gem_userptr.c - seems to provide fallback
drivers/gpu/drm/i915/i915_gpu_error.c - fallback to a smaller size without __GFP_NORETRY
drivers/gpu/drm/etnaviv/etnaviv_dump.c - coredump on error path, no
problem if it fails
drivers/gpu/drm/ttm/ttm_page_alloc.c - uses __GFP_NORETRY for transparent
hugepages, no problem with failure
drivers/gpu/drm/ttm/ttm_page_alloc_dma.c - uses __GFP_NORETRY for
transparent hugepages, no problem with failure
drivers/gpu/drm/msm/msm_gem_submit.c - uses __GFP_NORETRY to process ioctl
and lacks a fallback - it is a BUG - __GFP_NORETRY should be dropped
drivers/hv/hv_balloon.c - a vm balloon driver, no problem with failure
drivers/crypto/chelsio/chtls/chtls_io.c - fallback to a smaller size without __GFP_NORETRY
drivers/xen/balloon.c - a vm balloon driver, no problem with failure
drivers/mtd/mtdcore.c - fallback to a smaller size without __GFP_NORETRY
drivers/md/dm-verity-target.c - skips prefetch on failure, no problem
drivers/md/dm-writecache.c - falls back to a smaller i/os on failure
drivers/md/dm-bufio.c - reserves some buffers on creation and falls back
to them
drivers/md/dm-integrity.c - falls back to sector-by-sector verification
drivers/md/dm-kcopyd.c - falls back to reserved pages
drivers/md/dm-crypt.c - falls back to mempool
drivers/iommu/dma-iommu.c - fallback to a smaller size without __GFP_NORETRY
drivers/mmc/core/mmc_test.c - fallback to a smaller size, but doesn't drop
__GFP_NORETRY - BUG
drivers/staging/android/ion/ion_system_heap.c - fallback to a smaller size without __GFP_NORETRY
fs/cachefiles/ - uses __GFP_NORETRY extensively - since this is just a
cache, so failure supposedly shouldn't be problem - but it's hard to
verify the whole driver that it handles failures properly
fs/xfs/xfs_buf.c - uses __GFP_NORETRY only on readahead
fs/fscache/cookie.c - no fallback, but if it fails, the caller will just
invalidate the cache entry - no problem
fs/fscache/page.c - like above - failure will just inhibit caching
fs/nfs/write.c - fails only if allowed by the arugment never_fail
include/linux/kexec.h - no fallback - it seems to be a BUG
include/linux/pagemap.h - uses __GFP_NORETRY only on readahead
kernel/bpf/syscall.c - it says it need __GFP_NORETRY to avoid oom killer -
provides no fallback, so it seems to be BUG
kernel/events/ring_buffer.c - falls back to a smaller size, but doesn't drop
__GFP_NORETRY - BUG
kernel/groups.c - falls back to vmalloc (should it use kvmalloc?)
kernel/trace/ring_buffer.c - no fallback - BUG
kernel/trace/trace.c - no fallback - BUG
kernel/power/swap.c - falls back, but there is useless WARN_ON_ONCE(1) on
the fallback path
lib/debugobjects.c - turns off debugging on alloaction failure
lib/rhashtable.c - uses __GFP_NORETRY only with GFP_ATOMIC - __GFP_NORETRY
is useless
mm/hugetlb.c - no problem if hugepage allocation fails
mm/mempool.c - falls back to mempool
mm/kmemleak.c - disables kmemleak on allocation failure
mm/page_alloc.c/__page_frag_cache_refill - fallback to a smaller size without __GFP_NORETRY
mm/zswap.c/zswap_pool_create - used during pool creation - probably a BUG
mm/zswap.c/zswap_frontswap_store - used when compressing a page - probably
ok to fail
mm/memcontrol.c - I don't know if it is a bug
mm/shmem.c - used during hugepage allocation - ok to fail
mm/slub.c - fallback to a smaller size without __GFP_NORETRY
mm/util.c - fallback to vmalloc
net/packet/af_packet.c - fallback without __GFP_NORETRY
net/xdp/xsk_queue.c - no fallback - BUG
net/core/skbuff.c - fallback to a smaller size without __GFP_NORETRY
net/core/sock.c - fallback to a smaller size without __GFP_NORETRY
net/netlink/af_netlink.c - it's ok to fail when decreasing the size of skb
net/smc/smc_core.c - falls back to a smaller size, but doesn't drop
__GFP_NORETRY - BUG
net/netfilter/x_tables.c - __GFP_NORETRY is used to avoid the oom killer
- provides no fallback - it seems to be a BUG
security/integrity/ima/ima_crypto.c - fallback to a smaller size without __GFP_NORETRY
sound/core/memalloc.c - __GFP_NORETRY is used to avoid the oom killer
- provides no fallback - it seems to be a BUG
Mikulas