Re: frequent lockups in 3.18rc4

From: Thomas Gleixner
Date: Thu Nov 20 2014 - 16:58:34 EST

Next message: Shuah Khan: "Re: [PATCH v2] selftest: size: Add size test for Linux kernel"
Previous message: Eric Dumazet: "Re: [RFC] situation with csum_and_copy_... API"
In reply to: Tejun Heo: "Re: frequent lockups in 3.18rc4"
Next in thread: Andy Lutomirski: "Re: frequent lockups in 3.18rc4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 20 Nov 2014, Tejun Heo wrote:
> On Thu, Nov 20, 2014 at 12:50:36AM +0100, Frederic Weisbecker wrote:
> > > Are we talking about different per cpu allocators here or am I missing
> > > something completely non obvious?
> >
> > That's the same allocator yeah. So if the whole memory is dereferenced,
> > faults shouldn't happen indeed.
> >
> > Maybe that was a bug a few years ago but not anymore.
>
> It has been always like that tho. Percpu memory given out is always
> populated and cleared.
>
> > Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()?
> > After all it's allocated with vzalloc() so that part could be skipped. The memset(0)
>
> The vzalloc call is for the internal allocation bitmap not the actual
> percpu memory area. The actual address areas for percpu memory are
> obtained using pcpu_get_vm_areas() call and later get populated using
> map_kernel_range_noflush() (flush is performed after mapping is
> complete).
>
> Trying to remember what happens with vmalloc_fault(). Ah okay, so
> when a new PUD gets created for vmalloc area, we don't go through all
> PGDs and update them. The PGD entries get faulted in lazily. Percpu
> memory allocator clearing or not clearing the allocated area doesn't
> have anything to do with it. The memory area is always fully
> populated in the kernel page table. It's just that the population
> happened while a different PGD was active and this PGD hasn't been
> populated with the new PUD yet.

It's completely undocumented behaviour, whether it has been that way
for ever or not. And I agree with Fredric, that it is insane. Actuallu
it's beyond insane, really.

> So, yeap, vmalloc_fault() can always happen when accessing vmalloc
> areas and the only way to avoid that would be removing lazy PGD
> population - going through all PGDs and populating new PUDs
> immediately.

There is no requirement to go through ALL PGDs and populate that stuff
immediately.

Lets look at the two types of allocations

1) Kernel percpu allocations

2) Per process/task percpu allocations

Of course we do not have a way to distinguish those, but we really
should have one.

#1 Kernel percpu allocations usually happen in the context of driver
bringup, subsystem initialization, interrupt setup etc.

So this is functionality which is not a hotpath and usually
requires some form of synchronization versus the rest of the system
anyway.

The per cpu population stuff is serialized with a mutex anyway, so
what's wrong to have a globaly visible percpu sequence counter,
which is incremented whenever a new allocation is populated or torn
down?

We can make that sequence counter a per cpu variable as well to
avoid the issues of a global variable (preferrably that's a
compile/boot time allocated percpu variable to avoid the obvious
circulus vitiosus)

Now after that increment the allocation side needs to wait for a
scheduling cycle on all cpus (we have mechanisms for that)

So in the scheduler if the same task gets reselected you check that
sequence count and update the PGD if different. If a task switch
happens then you also need to check the sequence count and act
accordingly.

If we make the sequence counter a percpu variable as outlined above
the overhead of checking this is just noise versus the other
nonsense we do in schedule().

#2 That's process related statistics and instrumentation stuff.

Now that just needs a immediate population on the process->mm->pgd
aside of the init_mm.pgd, but that's really not a big deal.

Of course that does not solve the issues we have with the current
infrastructure retroactively, but it allows us to avoid fuckups like
the one Frederic was talking about that perf invented its own kmalloc
based 'percpu' replacement just to workaround the shortcoming in a
particular place.

What really frightens me is the potential and well hidden fuckup
potential which lurks around the corner and the hard to debug once in
a while fallout which might be caused by this.

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Shuah Khan: "Re: [PATCH v2] selftest: size: Add size test for Linux kernel"
Previous message: Eric Dumazet: "Re: [RFC] situation with csum_and_copy_... API"
In reply to: Tejun Heo: "Re: frequent lockups in 3.18rc4"
Next in thread: Andy Lutomirski: "Re: frequent lockups in 3.18rc4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]