Re: [PATCH 1/1] sched/fair: Fix invalid pointer dereference in child_cfs_rq_on_list()

From: Aboorva Devarajan
Date: Wed Mar 05 2025 - 23:48:58 EST


On Wed, 2025-03-05 at 10:23 +0100, Dietmar Eggemann wrote:
> On 05/03/2025 09:21, Vincent Guittot wrote:
> > On Tue, 4 Mar 2025 at 18:00, Aboorva Devarajan <aboorvad@xxxxxxxxxxxxx> wrote:
> > >
> > > In child_cfs_rq_on_list(), leaf_cfs_rq_list.prev is expected to point to
> > > a valid cfs_rq->leaf_cfs_rq_list in the hierarchy. However, when accessed
> > > from the first node in a list, leaf_cfs_rq_list.prev can incorrectly point
> > > back to the list head (rq->leaf_cfs_rq_list) instead of another
> > > cfs_rq->leaf_cfs_rq_list.
> > >
> > > The function does not handle this case, leading to incorrect pointer
> > > calculations and unintended memory accesses, which can result in a kernel
> > > crash.
> > >
> > > A recent attempt to reorder fields in struct rq exposed this issue by
> > > modifying memory offsets and affecting how pointer computations are
> > > resolved. While the problem existed before, it was previously masked by
> > > specific field arrangement. The reordering caused erroneous pointer
> > > accesses, leading to a NULL dereference and a crash, as seen in the
>
> I'm running tip/sched/core on arm64 and I still only see the wrong
> pointer for 'prev_cfs_rq->tg->parent' in the 'prev ==
> &rq->leaf_cfs_rq_list' case?
>
> ...
> cpu=5 prev_cfs_rq->tg=ffff00097efb63a0 parent=0000000000000010
> cfs_rq->tg=ffff000802084000
> ...
>

Hi Dietmar,

Yes, you are right, I meant that we will still have invalid pointers and use it
silently in the vanilla kernel, but it won't always lead to a crash.

The crash in this specific case happens if `prev_cfs_rq->tg` points to a memory
location that cannot be de-referenced. Otherwise, the function de-references and
uses memory locations that are not valid but did not cause a visible failure so far.

Here are more details on what I meant by reordering the runqueue:

With the system and kernel configuration, I encountered the crash while trying
to reorder the runqueue structure, here is the minimal change that caused the
crash on top of v6.14-rc5 kernel:

---
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c8512a9fb022..597c1e6a9b5d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1143,8 +1143,8 @@ struct rq {

#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
+ struct list_head *tmp_alone_branch;
struct list_head leaf_cfs_rq_list;
- struct list_head *tmp_alone_branch;
#endif /* CONFIG_FAIR_GROUP_SCHED */
---

Here is the crash signature:

[ 1.114431][ T552] Kernel attempted to read user page (10000012f) - exploit attempt? (uid: 0)
[ 1.114440][ T552] BUG: Unable to handle kernel data access on read at 0x10000012f
[ 1.114446][ T552] Faulting instruction address: 0xc0000000001c1044
[ 1.116344][ T241] pstore: backend (nvram) writing error (-1)
[ 1.116351][ T241]
[ 1.116354][ T552] Oops: Kernel access of bad area, sig: 11 [#2]
[ 1.116354][ T241] note: kworker/44:0[241] exited with irqs disabled
[ 1.116356][ T552] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[ 1.116368][ T552] Modules linked in: autofs4
[ 1.116374][ T552] CPU: 73 UID: 0 PID: 552 Comm: kworker/73:1 Tainted: G D 6.14.0-rc5-dirty #193
[ 1.116381][ T552] Tainted: [D]=DIE
[ 1.116384][ T552] Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
[ 1.116391][ T552] NIP: c0000000001c1044 LR: c0000000001c0f98 CTR: c000000000027ad4
[ 1.116396][ T552] REGS: c000000076627870 TRAP: 0300 Tainted: G D (6.14.0-rc5-dirty)
[ 1.116401][ T552] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008202 XER: 20040154
...
[ 1.116467][ T552] NIP [c0000000001c1044] sched_balance_update_blocked_averages+0x35c/0x88c
[ 1.116474][ T552] LR [c0000000001c0f98] sched_balance_update_blocked_averages+0x2b0/0x88c

~~~~

Before reordering, struct rq and cfs_rq had the following memory layout
(snippet from pahole):

struct rq {
... (offset bytes)
struct list_head leaf_cfs_rq_list; /* 4048 16 */
struct list_head * tmp_alone_branch; /* 4064 8 */
unsigned int nr_uninterruptible; /* 4072 4 */
...
}
struct cfs_rq {
...
struct list_head leaf_cfs_rq_list; /* 456 16 */
struct task_group * tg; /* 472 8 */
...
}

In child_cfs_rq_on_list(), `prev_cfs_rq` is computed using the `container_of`
macro:

prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);

Since `prev == &rq->leaf_cfs_rq_list`, this results in:

prev_cfs_rq = rq->leaf_cfs_rq_list - 456 (offset of leaf_cfs_rq_list in cfs_rq)

Then, `prev_cfs_rq->tg` is accessed at an offset of 472 bytes from base cfs_rq:

prev_cfs_rq->tg = (rq->leaf_cfs_rq_list - 456) + 472
= rq->leaf_cfs_rq_list + 16
= rq->tmp_alone_branch

Since `tmp_alone_branch` is always at this point a valid pointer, dereferencing
`prev_cfs_rq->tg->parent` doesn't cause a crash, even though it is not
a valid task_group pointer.

~~~~

After reordering, the layout of `struct rq` changed as follows:

struct rq {
... (offset bytes)
struct list_head * tmp_alone_branch; /* 4048 8 */ -> this is shuffled up
struct list_head leaf_cfs_rq_list; /* 4056 16 */
unsigned int nr_uninterruptible; /* 4072 4 */
...
}

The layout of `struct cfs_rq` is unchanged.

Now, when the same pointer arithmetic is performed:

prev_cfs_rq = rq->leaf_cfs_rq_list - 456

prev_cfs_rq->tg = (rq->leaf_cfs_rq_list - 456) + 472
= rq->leaf_cfs_rq_list + 16
= rq->nr_uninterruptible # now this mem location corresponds to nr_uninterruptible.

Since nr_uninterruptible is an integer rather than a de-referenceable pointer, I presume because of
this, de-referencing parent from prev_cfs_rq->tg results in a crash. Otherwise,
incorrect pointers are silently used without a visible failure.


But looks like a patch similar to this is merged yesterday [1], so this can
be ignored :)


[1] https://lore.kernel.org/all/174119292742.14745.16827644501260146974.tip-bot2@tip-bot2/

Thanks,
Aboorva


> > > ...
> > >
>