Re: Null pointer crash at find_idlest_group on db845c w/ linus/master
From: John Stultz
Date: Tue Dec 03 2019 - 22:46:28 EST
On Tue, Dec 3, 2019 at 4:13 PM Valentin Schneider
<valentin.schneider@xxxxxxx> wrote:
>
> On 03/12/2019 23:49, Valentin Schneider wrote:
> > On 03/12/2019 23:20, Valentin Schneider wrote:
> >> Looking at the code, I think I got it. In find_idlest_group() we do
> >> initialize 'idlest_sgs' (just like busiest_stat in LB) but 'idlest' is just
> >> NULL. The latter is dereferenced in update_pick_idlest() just for the misfit
> >> case, which goes boom. And I reviewed the damn thing... Bleh.
> >>
> >> Fixup looks easy enough, lemme write one up.
> >>
> >
> > Wait no, that can't be right. We can only get in there if both 'group' and
> > 'idlest' have the same group_type, which can't be true on the first pass.
> > So if we go through the misfit stuff, idlest *has* to be set to something.
> > Bah.
> >
>
> So I think the thing really is dying on a sched_group->sgc deref (pahole says
> sgc is at offset #16), which means we have a NULL sched_group somewhere, but
> I don't see how. That can either be 'local' (can't be, first group we visit
> and doesn't go through update_pick_idlest()) or 'idlest' (see previous email).
>
> Now, it's bedtime for me, if you get the chance in the meantime can you give
> this a shot? I was about to send it out but realized it didn't really make
> sense, but you never know...
>
> Also, if it is indeed misfit related, I'm surprised we (Arm folks) haven't
> hit it sooner. We've had our scheduler tests running on the LB rework for at
> least a month, so we should've hit it.
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 08a233e97a01..e19ab7bff0f3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8348,7 +8348,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
> return false;
>
> case group_misfit_task:
> - /* Select group with the highest max capacity */
> + /*
> + * Select group with the highest max capacity. First group we
> + * visit gets picked as idlest to allow later capacity
> + * comparisons.
> + */
> + if (!idlest)
> + return true;
> +
> if (idlest->sgc->max_capacity >= group->sgc->max_capacity)
> return false;
> break;
Thanks for the quick patch! Unfortunately I still managed to trip it
with this patch :(
thanks
-john
[ 191.807900] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000010
[ 191.814822] audit: audit_lost=284 audit_rate_limit=5 audit_backlog_limit=64
[ 191.816779] Mem abort info:
[ 191.816782] ESR = 0x96000005
[ 191.816786] EC = 0x25: DABT (current EL), IL = 32 bits
[ 191.816789] SET = 0, FnV = 0
[ 191.816791] EA = 0, S1PTW = 0
[ 191.816793] Data abort info:
[ 191.816797] ISV = 0, ISS = 0x00000005
[ 191.816799] CM = 0, WnR = 0
[ 191.816804] user pgtable: 4k pages, 39-bit VAs, pgdp=00000001719fc000
[ 191.816809] [0000000000000010] pgd=0000000000000000, pud=0000000000000000
[ 191.823849] audit: rate limit exceeded
[ 191.826660] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 191.873941] Modules linked in:
[ 191.877043] CPU: 7 PID: 50 Comm: kauditd Tainted: G W
5.4.0-mainline-11226-g81d7081fb03e #1276
[ 191.887039] Hardware name: Thundercomm Dragonboard 845c (DT)
[ 191.892751] pstate: 60c00085 (nZCv daIf +PAN +UAO)
[ 191.897612] pc : find_idlest_group.isra.95+0x368/0x690
[ 191.902809] lr : find_idlest_group.isra.95+0x104/0x690
[ 191.908002] sp : ffffffc01036b890
[ 191.911356] x29: ffffffc01036b890 x28: ffffffe471250480
[ 191.916720] x27: ffffff80f1828000 x26: 0000000000000001
[ 191.922082] x25: 0000000000000000 x24: ffffffc01036b998
[ 191.927447] x23: ffffff80f8e41a00 x22: ffffffe471519e30
[ 191.932811] x21: ffffff80f8f16aa0 x20: ffffff80f8f16e80
[ 191.938173] x19: ffffffe47151a610 x18: ffffffe470fd1ef0
[ 191.943543] x17: 0000000000000000 x16: 0000000000000000
[ 191.948912] x15: 0000000000000001 x14: 632c323135633a30
[ 191.954275] x13: 733a656c69665f61 x12: 7461645f7070613a
[ 191.959636] x11: 0000000000000730 x10: 000000000000025d
[ 191.964998] x9 : 0000000000003b15 x8 : 0000000000000075
[ 191.970361] x7 : ffffff80f8f16200 x6 : ffffffe471250000
[ 191.975728] x5 : 0000000000000000 x4 : 0000000000000008
[ 191.981090] x3 : ffffff80f8f16100 x2 : ffffff80f8f16d00
[ 191.986452] x1 : 0000000000000000 x0 : 0000000000000002
[ 191.991815] Call trace:
[ 191.994309] find_idlest_group.isra.95+0x368/0x690
[ 191.999155] select_task_rq_fair+0x4e4/0xd88
[ 192.003476] try_to_wake_up+0x21c/0x7f8
[ 192.007353] default_wake_function+0x34/0x48
[ 192.011675] pollwake+0x98/0xc8
[ 192.014863] __wake_up_common+0x90/0x158
[ 192.018838] __wake_up_common_lock+0x88/0xd0
[ 192.023163] __wake_up_sync_key+0x40/0x50
[ 192.027228] sock_def_readable+0x44/0x78
[ 192.031197] __netlink_sendskb+0x44/0x58
[ 192.035170] netlink_unicast+0x220/0x258
[ 192.039143] kauditd_send_queue+0xa4/0x158
[ 192.043292] kauditd_thread+0xf4/0x248
[ 192.047085] kthread+0x130/0x138
[ 192.050355] ret_from_fork+0x10/0x1c
[ 192.053984] Code: 54001260 7100081f 54001601 a9478ba1 (f9400821)
[ 192.060141] ---[ end trace c650b8d609e54a39 ]---
[ 192.064812] Kernel panic - not syncing: Fatal exception