Re: [PATCH 0/2] Move away from non-failing small allocations

From: Tetsuo Handa
Date: Thu Apr 02 2015 - 07:53:13 EST


Tetsuo Handa wrote:
> Michal Hocko wrote:
> > We are seeing issues with the fs code now because the test cases which
> > led to the current discussion exercise FS code. The code which does
> > lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
> > we can find other subsystems if we try hard enough.
>
> I'm expecting for patches which avoids deadlock by lock(); kmalloc(GFP_KERNEL).
>
> > > static inline void enter_fs_code(struct super_block *sb)
> > > {
> > > if (sb->my_small_allocations_can_fail)
> > > current->small_allocations_can_fail++;
> > > }
> > >
> > > that way (or something similar) we can select the behaviour on a per-fs
> > > basis and the rest of the kernel remains unaffected. Other subsystems
> > > can opt in as well.
> >
> > This is basically leading to GFP_MAYFAIL which is completely backwards
> > (the hard requirement should be an exception not a default rule).
> > I really do not want to end up with stuffing random may_fail annotations
> > all over the kernel.
> >
>
> I wish that GFP_NOFS / GFP_NOIO regions are annotated with
>
> static inline void enter_fs_code(void)
> {
> #ifdef CONFIG_DEBUG_GFP_FLAGS
> current->in_fs_code++;
> #endif
> }
>
> static inline void leave_fs_code(void)
> {
> #ifdef CONFIG_DEBUG_GFP_FLAGS
> current->in_fs_code--;
> #endif
> }
>
> static inline void enter_io_code(void)
> {
> #ifdef CONFIG_DEBUG_GFP_FLAGS
> current->in_io_code++;
> #endif
> }
>
> static inline void leave_io_code(void)
> {
> #ifdef CONFIG_DEBUG_GFP_FLAGS
> current->in_io_code--;
> #endif
> }
>
> so that inappropriate GFP_KERNEL usage inside GFP_NOFS region are catchable
> by doing
>
> struct page *
> __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, nodemask_t *nodemask)
> {
> struct zoneref *preferred_zoneref;
> struct page *page = NULL;
> unsigned int cpuset_mems_cookie;
> int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
> gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
> struct alloc_context ac = {
> .high_zoneidx = gfp_zone(gfp_mask),
> .nodemask = nodemask,
> .migratetype = gfpflags_to_migratetype(gfp_mask),
> };
>
> gfp_mask &= gfp_allowed_mask;
> +#ifdef CONFIG_DEBUG_GFP_FLAGS
> + WARN_ON(current->in_fs_code & (gfp_mask & __GFP_FS));
> + WARN_ON(current->in_io_code & (gfp_mask & __GFP_IO));
> +#endif
>
> lockdep_trace_alloc(gfp_mask);
>
>
> . It is difficult for non-fs developers to determine whether they need to use
> GFP_NOFS than GFP_KERNEL in their code. An example is seen at
> http://marc.info/?l=linux-security-module&m=138556479607024&w=2 .

I didn't receive responses about idea of annotating GFP_NOFS/GFP_NOIO
sections. Therefore, I wrote a patch shown below.

Dave and Michal, can we assume that a thread which initiated a
GFP_NOFS/GFP_NOIO section always terminates that section (as with
rcu_read_lock()/rcu_read_unlock()) ? If we can't assume it,
mask_gfp_flags() in below patch cannot work.
----------------------------------------
>From 0b18f21c9c9aef2353d355afc83b2a2193bbced7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date: Thu, 2 Apr 2015 16:47:47 +0900
Subject: [PATCH] Add macros for annotating GFP_NOFS/GFP_NOIO allocations.

While there are rules for when to use GFP_NOFS or GFP_NOIO, there are
locations where GFP_KERNEL is by error used. Such incorrect usage may
cause deadlock under memory pressure. But it is difficult for non-fs
developers to determine whether they need to use GFP_NOFS than GFP_KERNEL
in their code. Also, it is difficult to hit the deadlock caused by such
incorrect usage. Therefore, this patch introduces 4 macros for annotating
the duration where GFP_NOFS or GFP_NOIO needs to be used.

enter_fs_code() is for annotating that allocations with __GFP_FS are
not permitted until corresponding leave_fs_code() is called.

leave_fs_code() is for annotating that allocations with __GFP_FS are
permitted from now on.

enter_io_code() is for annotating that allocations with __GFP_IO are
not permitted until corresponding leave_io_code() is called.

leave_io_code() is for annotating that allocations with __GFP_IO are
permitted from now on.

These macros are no-op unless CONFIG_DEBUG_GFP_FLAGS is defined.

Note that this patch does not insert these macros into GFP_NOFS/GFP_NOIO
users. Inserting these macros with appropriate comments are up to
GFP_NOFS/GFP_NOIO users who know the dependency well.

This patch also introduces 1 function which makes use of these macros
when CONFIG_DEBUG_GFP_FLAGS is defined.

mask_gfp_flags() is for checking and printing warnings when __GFP_FS is
used between enter_fs_code() and leave_fs_code() or __GFP_IO is used
between enter_io_code() and leave_io_code().

If GFP_KERNEL allocation is requested between enter_fs_code() and
leave_fs_code(), a warning message

GFP_FS allocation ($gfp_mask) is unsafe for this context

is reported with a backtrace.

If GFP_KERNEL or GFP_NOFS allocation is requested between enter_io_code()
and leave_io_code(), a warning message

GFP_IO allocation ($gfp_mask) is unsafe for this context

is reported with a backtrace.

Note that currently mask_gfp_flags() cannot work unless gfp_allowed_mask
is involved. That is, mask_gfp_flags() cannot work if memory allocation
can be handled by e.g. returning partial memory from already allocated
pages. Handling such cases is deferred to future patches.

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
---
include/linux/gfp.h | 17 +++++++++++++++++
include/linux/sched.h | 4 ++++
mm/Kconfig.debug | 7 +++++++
mm/page_alloc.c | 17 ++++++++++++++++-
mm/slab.c | 4 ++--
mm/slob.c | 4 ++--
mm/slub.c | 6 +++---
7 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51bd1e7..97654ff 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -421,4 +421,21 @@ extern void init_cma_reserved_pageblock(struct page *page);

#endif

+#ifdef CONFIG_DEBUG_GFP_FLAGS
+#define enter_fs_code() { current->in_fs_code++; }
+#define leave_fs_code() { current->in_fs_code--; }
+#define enter_io_code() { current->in_io_code++; }
+#define leave_io_code() { current->in_io_code--; }
+extern gfp_t mask_gfp_flags(gfp_t gfp_mask);
+#else
+#define enter_fs_code() do { } while (0)
+#define leave_fs_code() do { } while (0)
+#define enter_io_code() do { } while (0)
+#define leave_io_code() do { } while (0)
+static inline gfp_t mask_gfp_flags(gfp_t gfp_mask)
+{
+ return gfp_mask & gfp_allowed_mask;
+}
+#endif
+
#endif /* __LINUX_GFP_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65..ebaf817 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1710,6 +1710,10 @@ struct task_struct {
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long task_state_change;
#endif
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+ unsigned char in_fs_code;
+ unsigned char in_io_code;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 957d3da..0f9ad6f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -28,3 +28,10 @@ config DEBUG_PAGEALLOC

config PAGE_POISONING
bool
+
+config DEBUG_GFP_FLAGS
+ bool "Check GFP flags passed to memory allocators"
+ depends on DEBUG_KERNEL
+ ---help---
+ Track and warn about incorrect use of __GFP_FS or __GFP_IO
+ memory allocations.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e2942..239b068 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2794,6 +2794,21 @@ got_pg:
return page;
}

+#ifdef CONFIG_DEBUG_GFP_FLAGS
+gfp_t mask_gfp_flags(gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_WAIT) {
+ WARN((gfp_mask & __GFP_FS) && current->in_fs_code,
+ "GFP_FS allocation (0x%x) is unsafe for this context\n",
+ gfp_mask);
+ WARN((gfp_mask & __GFP_IO) && current->in_io_code,
+ "GFP_IO allocation (0x%x) is unsafe for this context\n",
+ gfp_mask);
+ }
+ return gfp_mask & gfp_allowed_mask;
+}
+#endif
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -2812,7 +2827,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
.migratetype = gfpflags_to_migratetype(gfp_mask),
};

- gfp_mask &= gfp_allowed_mask;
+ gfp_mask = mask_gfp_flags(gfp_mask);

lockdep_trace_alloc(gfp_mask);

diff --git a/mm/slab.c b/mm/slab.c
index c4b89ea..8a73788 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3136,7 +3136,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
void *ptr;
int slab_node = numa_mem_id();

- flags &= gfp_allowed_mask;
+ flags = mask_gfp_flags(flags);

lockdep_trace_alloc(flags);

@@ -3224,7 +3224,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
unsigned long save_flags;
void *objp;

- flags &= gfp_allowed_mask;
+ flags = mask_gfp_flags(flags);

lockdep_trace_alloc(flags);

diff --git a/mm/slob.c b/mm/slob.c
index 94a7fed..c731eb4 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -430,7 +430,7 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
void *ret;

- gfp &= gfp_allowed_mask;
+ gfp = mask_gfp_flags(gfp);

lockdep_trace_alloc(gfp);

@@ -536,7 +536,7 @@ void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
{
void *b;

- flags &= gfp_allowed_mask;
+ flags = mask_gfp_flags(flags);

lockdep_trace_alloc(flags);

diff --git a/mm/slub.c b/mm/slub.c
index 82c4737..60fe0a9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1263,7 +1263,7 @@ static inline void kfree_hook(const void *x)
static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
gfp_t flags)
{
- flags &= gfp_allowed_mask;
+ flags = mask_gfp_flags(flags);
lockdep_trace_alloc(flags);
might_sleep_if(flags & __GFP_WAIT);

@@ -1276,7 +1276,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
static inline void slab_post_alloc_hook(struct kmem_cache *s,
gfp_t flags, void *object)
{
- flags &= gfp_allowed_mask;
+ flags = mask_gfp_flags(flags);
kmemcheck_slab_alloc(s, flags, object, slab_ksize(s));
kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags);
memcg_kmem_put_cache(s);
@@ -1339,7 +1339,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
struct kmem_cache_order_objects oo = s->oo;
gfp_t alloc_gfp;

- flags &= gfp_allowed_mask;
+ flags = mask_gfp_flags(flags);

if (flags & __GFP_WAIT)
local_irq_enable();
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/