Re: [tip:perfcounters/core] perf_counter: Simplify and fix taskmigration counting

From: Peter Zijlstra
Date: Fri Jun 19 2009 - 08:39:51 EST


On Fri, 2009-06-19 at 22:24 +1000, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > static struct perf_counter_context *perf_pin_task_context(struct task_struct *task)
> > {
> > struct perf_counter_context *ctx;
> > unsigned long flags;
> >
> > ctx = perf_lock_task_context(task, &flags);
> > if (ctx) {
> > ++ctx->pin_count;
> > get_ctx(ctx);
> > spin_unlock_irqrestore(&ctx->lock, flags);
> > }
> > return ctx;
> > }
> >
> > Is buggy because perf_lock_task_context() can return a dead context.
> >
> > the RCU read lock in perf_lock_task_context() only guarantees the memory
> > won't get freed, it doesn't guarantee the object is valid (in our case
> > refcount > 0).
> >
> > Therefore we can return a locked object that can get freed the moment we
> > release the rcu read lock.
> >
> > perf_pin_task_context() then increases the refcount and does an unlock
> > on freed memory.
> >
> > That increased refcount will cause a double free, in case it started out
> > with 0.
>
> Wow, good catch! Thanks for finding that.

The clue that got me started was this scribble:

[ 123.800172] =============================================================================
[ 123.808725] BUG kmalloc-512: Poison overwritten
[ 123.813422] -----------------------------------------------------------------------------
[ 123.813424]
[ 123.823390] INFO: 0xffff8801bd546480-0xffff8801bd546588. First byte 0x6c instead of 0x6b
[ 123.831777] INFO: Allocated in perf_counter_init_task+0x8d/0x262 age=351 cpu=1 pid=12276
[ 123.840047] INFO: Freed in free_ctx+0x10/0x12 age=256 cpu=1 pid=12669
[ 123.846702] INFO: Slab 0xffffea0009914f60 objects=28 used=14 fp=0xffff8801bd546480 flags=0x2000000000040c3
[ 123.856710] INFO: Object 0xffff8801bd546480 @offset=9344 fp=0xffff8801bd546fe8
[ 123.856712]
[ 123.865782] Bytes b4 0xffff8801bd546470: 31 3b fd ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a 1;<FD><FF>....ZZZZZZZZ
[ 123.876163] Object 0xffff8801bd546480: 6c 6c 6b 6b 6b 6b 6b 6b ff ff ff ff 6b 6b 6b 6b llkkkkkk<FF><FF><FF><FF>kkkk
[ 123.886447] Object 0xffff8801bd546490: ff ff ff ff ff ff ff ff 6b 6b 6b 6b 6b 6b 6b 6b <FF><FF><FF><FF><FF><FF><FF><FF>kkkkkkkk
[ 123.896810] Object 0xffff8801bd5464a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 123.907162] Object 0xffff8801bd5464b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 123.917421] Object 0xffff8801bd5464c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 123.927837] Object 0xffff8801bd5464d0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 123.938147] Object 0xffff8801bd5464e0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
[ 123.948398] Object 0xffff8801bd5464f0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk

And from pahole (thanks acme!) we find that the only relevant perf
object in that bucket is:

struct perf_counter_context {
spinlock_t lock; /* 0 64 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct mutex mutex; /* 64 152 */
/* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
struct list_head counter_list; /* 216 16 */
struct list_head event_list; /* 232 16 */
int nr_counters; /* 248 4 */
int nr_active; /* 252 4 */
/* --- cacheline 4 boundary (256 bytes) --- */
int is_active; /* 256 4 */
atomic_t refcount; /* 260 4 */
struct task_struct * task; /* 264 8 */
u64 time; /* 272 8 */
u64 timestamp; /* 280 8 */
struct perf_counter_context * parent_ctx; /* 288 8 */
u64 parent_gen; /* 296 8 */
u64 generation; /* 304 8 */
int pin_count; /* 312 4 */

/* XXX 4 bytes hole, try to pack */

/* --- cacheline 5 boundary (320 bytes) --- */
struct rcu_head rcu_head; /* 320 16 */

/* size: 336, cachelines: 6 */
/* sum members: 332, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};

typedef struct {
raw_spinlock_t raw_lock; /* 0 4 */
unsigned int magic; /* 4 4 */
unsigned int owner_cpu; /* 8 4 */

/* XXX 4 bytes hole, try to pack */

void * owner; /* 16 8 */
struct lockdep_map dep_map; /* 24 40 */
/* --- cacheline 1 boundary (64 bytes) --- */

/* size: 64, cachelines: 1 */
/* sum members: 60, holes: 1, sum holes: 4 */
} spinlock_t; /* definitions: 1 */


Which learns us that we had lock/unlock (6b->6c) + owner and owner_cpu
scribble as through:

static inline void debug_spin_unlock(spinlock_t *lock)
{
SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic");
SPIN_BUG_ON(!spin_is_locked(lock), lock, "already unlocked");
SPIN_BUG_ON(lock->owner != current, lock, "wrong owner");
SPIN_BUG_ON(lock->owner_cpu != raw_smp_processor_id(),
lock, "wrong CPU");
lock->owner = SPINLOCK_OWNER_INIT;
lock->owner_cpu = -1;
}

The SPIN_BUG_ON() failed to trigger in this particular case because
earlier corruption took out lockdep.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/