Re: [PATCH 0/2] Move away from non-failing small allocations

From: Tetsuo Handa
Date: Tue Mar 17 2015 - 10:06:44 EST

Next message: Sudip Mukherjee: "Re: [PATCH 1/2] staging: unisys: remove unused variables"
Previous message: Rafael J. Wysocki: "Re: [update][PATCH v10 06/21] ACPI / sleep: Introduce CONFIG_ACPI_GENERIC_SLEEP"
In reply to: Michal Hocko: "Re: [PATCH 0/2] Move away from non-failing small allocations"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Michal Hocko wrote:
> On Mon 16-03-15 15:38:43, Andrew Morton wrote:
> > Realistically, I don't think this overall effort will be successful -
> > we'll add the knob, it won't get enough testing and any attempt to
> > alter the default will be us deliberately destabilizing the kernel
> > without knowing how badly :(
>
> Without the knob we do not allow users to test this at all though and
> the transition will _never_ happen. Which is IMHO bad.
>

Even with the knob, quite little users will test this. The consequence is
likely that end users rush into customer support center about obscure bugs.
I'm working at a support center, and such bugs are really annoying.

> > I wonder if we can alter the behaviour only for filesystem code, so we
> > constrain the new behaviour just to that code where we're having
> > problems. Most/all fs code goes via vfs methods so there's a reasonably
> > small set of places where we can call
>
> We are seeing issues with the fs code now because the test cases which
> led to the current discussion exercise FS code. The code which does
> lock(); kmalloc(GFP_KERNEL) is not reduced there though. I am pretty sure
> we can find other subsystems if we try hard enough.

I'm expecting for patches which avoids deadlock by lock(); kmalloc(GFP_KERNEL).

> > static inline void enter_fs_code(struct super_block *sb)
> > {
> > if (sb->my_small_allocations_can_fail)
> > current->small_allocations_can_fail++;
> > }
> >
> > that way (or something similar) we can select the behaviour on a per-fs
> > basis and the rest of the kernel remains unaffected. Other subsystems
> > can opt in as well.
>
> This is basically leading to GFP_MAYFAIL which is completely backwards
> (the hard requirement should be an exception not a default rule).
> I really do not want to end up with stuffing random may_fail annotations
> all over the kernel.
>

I wish that GFP_NOFS / GFP_NOIO regions are annotated with

static inline void enter_fs_code(void)
{
#ifdef CONFIG_DEBUG_GFP_FLAGS
current->in_fs_code++;
#endif
}

static inline void leave_fs_code(void)
{
#ifdef CONFIG_DEBUG_GFP_FLAGS
current->in_fs_code--;
#endif
}

static inline void enter_io_code(void)
{
#ifdef CONFIG_DEBUG_GFP_FLAGS
current->in_io_code++;
#endif
}

static inline void leave_io_code(void)
{
#ifdef CONFIG_DEBUG_GFP_FLAGS
current->in_io_code--;
#endif
}

so that inappropriate GFP_KERNEL usage inside GFP_NOFS region are catchable
by doing

struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
struct zoneref *preferred_zoneref;
struct page *page = NULL;
unsigned int cpuset_mems_cookie;
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = {
.high_zoneidx = gfp_zone(gfp_mask),
.nodemask = nodemask,
.migratetype = gfpflags_to_migratetype(gfp_mask),
};

gfp_mask &= gfp_allowed_mask;
+#ifdef CONFIG_DEBUG_GFP_FLAGS
+ WARN_ON(current->in_fs_code & (gfp_mask & __GFP_FS));
+ WARN_ON(current->in_io_code & (gfp_mask & __GFP_IO));
+#endif

lockdep_trace_alloc(gfp_mask);

. It is difficult for non-fs developers to determine whether they need to use
GFP_NOFS than GFP_KERNEL in their code. An example is seen at
http://marc.info/?l=linux-security-module&m=138556479607024&w=2 .

Moreover, I don't know how GFP flags are managed when stacked like
"a swap file on ext4 on top of LVM (with snapshots) on a RAID array
connected over iSCSI" (quoted from comments on Jon's writeup), but I
wish that the distinction between GFP_KERNEL / GFP_NOFS / GFP_NOIO
are removed from memory allocating function callers by doing

static inline void enter_fs_code(void)
{
current->in_fs_code++;
}

static inline void leave_fs_code(void)
{
current->in_fs_code--;
}

static inline void enter_io_code(void)
{
current->in_io_code++;
}

static inline void leave_io_code(void)
{
current->in_io_code--;
}

struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
struct zoneref *preferred_zoneref;
struct page *page = NULL;
unsigned int cpuset_mems_cookie;
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = {
.high_zoneidx = gfp_zone(gfp_mask),
.nodemask = nodemask,
.migratetype = gfpflags_to_migratetype(gfp_mask),
};

gfp_mask &= gfp_allowed_mask;
+ if (current->in_fs_code)
+ gfp_mask &= ~__GFP_FS;
+ if (current->in_io_code)
+ gfp_mask &= ~__GFP_IO;

lockdep_trace_alloc(gfp_mask);

so that GFP flags passed to memory allocations involved by stacking
will be appropriately masked.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sudip Mukherjee: "Re: [PATCH 1/2] staging: unisys: remove unused variables"
Previous message: Rafael J. Wysocki: "Re: [update][PATCH v10 06/21] ACPI / sleep: Introduce CONFIG_ACPI_GENERIC_SLEEP"
In reply to: Michal Hocko: "Re: [PATCH 0/2] Move away from non-failing small allocations"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]