Re: [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops

From: Vernon Yang

Date: Wed May 06 2026 - 09:10:21 EST

On Mon, May 4, 2026 at 1:35 AM <bot+bpf-ci@xxxxxxxxxx> wrote:
>
> > diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
> > new file mode 100644
> > index 000000000000..1c8a6f7ad8f1
> > --- /dev/null
> > +++ b/include/linux/bpf_huge_memory.h
> > @@ -0,0 +1,35 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ */
> > +
> > +#ifndef __BPF_HUGE_MEMORY_H
> > +#define __BPF_HUGE_MEMORY_H
> > +
> > +/**
> > + * struct bpf_mthp_ops - BPF callbacks for mTHP operations
> > + * @mthp_choose: Choose the custom mTHP orders
> > + *
> > + * This structure defines the interface for BPF programs to customize
> > + * mTHP behavior through struct_ops programs.
> > + */
> > +struct bpf_mthp_ops {
> > + unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders);
> > +};
> > +
> > +#if defined(CONFIG_BPF_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BPF_SYSCALL)
> > +/**
> > + * bpf_mthp_choose: Choose the custom mTHP orders using bpf
> ^
> The kernel-doc format for functions uses parentheses followed by a dash, not
> a colon. When scripts/kernel-doc processes this file, will it skip this
> function documentation?

I will fix that in the next version. Thanks!

> > + * @mm: task mm_struct
> > + * @orders: original orders
> > + *
> > + * Return suited mTHP orders.
> > + */
>
> [ ... ]
>
> > diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> > new file mode 100644
> > index 000000000000..e34e0a35edac
> > --- /dev/null
> > +++ b/mm/bpf_huge_memory.c
> > @@ -0,0 +1,169 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * Huge memory related BPF code
> > + *
> > + * Author: Vernon Yang <yanglincheng@xxxxxxxxxx>
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/srcu.h>
> > +
> > +/* Protects cgrp->mthp_ops pointer for read and write. */
> > +DEFINE_SRCU(mthp_bpf_srcu);
> > +
> > +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
> > +{
> > + struct cgroup *cgrp;
> > + struct mem_cgroup *memcg;
> > + struct bpf_mthp_ops *ops;
> > + int idx;
> > +
> > + memcg = get_mem_cgroup_from_mm(mm);
> > + if (!memcg)
> > + return orders;
> > +
> > + cgrp = memcg->css.cgroup;
> > + ops = READ_ONCE(cgrp->mthp_ops);
> ^^^^^
> > + if (unlikely(ops)) {
> > + idx = srcu_read_lock(&mthp_bpf_srcu);
> ^^^^^^^^^^^^^^^^^
> Could there be a use-after-free here? The pointer is read BEFORE entering
> the SRCU read-side critical section. If bpf_mthp_ops_unreg runs on another
> CPU between these two lines, synchronize_srcu returns immediately because
> no reader holds the lock yet.
>
> Race sequence:
> (1) Thread A loads ops (non-NULL) above
> (2) Thread A is preempted
> (3) Thread B calls bpf_mthp_ops_unreg, clears the pointer, calls
> synchronize_srcu which returns immediately
> (4) Thread B frees the backing memory
> (5) Thread A resumes, enters SRCU section, then dereferences ops->mthp_choose
>
> For SRCU to protect the pointer across its use, the reader must hold
> srcu_read_lock before reading the pointer. Would moving the srcu_read_lock
> before the READ_ONCE prevent this race?

Good catch, I will fix that in the next version. Thanks!

> > + if (ops->mthp_choose)
> > + orders = ops->mthp_choose(cgrp, orders);
> > + srcu_read_unlock(&mthp_bpf_srcu, idx);
> > + }
> > +
> > + mem_cgroup_put(memcg);
> > +
> > + return orders;
> > +}
>
> [ ... ]
>
> > +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> > +{
> > + struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> > + struct bpf_mthp_ops *ops = kdata;
> > + struct cgroup *cgrp = st_link->cgroup;
> ^^^^^^^^^^^^
> Can this dereference NULL? When bpf_struct_ops_map_update_elem calls the
> reg callback at kernel/bpf/bpf_struct_ops.c:873, it passes link = NULL for
> maps without BPF_F_LINK. When link is NULL, st_link->cgroup dereferences
> NULL before the check at the next line runs.

Yeah, I will fix that in the next version. Thanks!

> > + struct cgroup_subsys_state *pos;
> > +
> > + /* The link is not yet fully initialized, but cgroup should be set */
> > + if (!link)
> > + return -EOPNOTSUPP;
>
> Does this comment match reality? Looking at bpf_struct_ops_link_create in
> kernel/bpf/bpf_struct_ops.c:1397, the link was fully initialized by
> bpf_link_init and bpf_link_prime before calling reg. Also, struct
> bpf_struct_ops_link does not have a cgroup field in its definition at
> kernel/bpf/bpf_struct_ops.c:58.
>
> Also, does struct bpf_struct_ops_link actually have a cgroup field? Its
> definition shows only { struct bpf_link link; struct bpf_map __rcu *map;
> wait_queue_head_t wait_hup; }. The patch does not add this field to the
> struct. Would this cause a compilation error?

This series is based on linux v7.1-rc1 (26fd6bff2c05) + "mm: BPF OOM"[3]
first four patches.

All of the above definitions are in "mm: BPF OOM" PATCH#2.

https://lore.kernel.org/linux-mm/20260127024421.494929-3-roman.gushchin@xxxxxxxxx/

> > +
> > + cgroup_lock();
> > + css_for_each_descendant_pre(pos, &cgrp->self) {
> > + struct cgroup *child = pos->cgroup;
> > +
> > + if (READ_ONCE(child->mthp_ops)) {
> > + /* TODO
> ^^^^^
> This isn't a bug, but per BPF comment style (subsystem/bpf.md), multi-line
> comments should have the opening /* on its own line in BPF-related files.
>
> > + * Do not destroy the cgroup hierarchy property.
> > + * If an eBPF program already exists in the sub-cgroup,
> > + * trigger an error and clear the already set
> > + * bpf_mthp_ops data.
> > + */
> > + continue;
> > + }
> > + WRITE_ONCE(child->mthp_ops, ops);
> > + }
> > + cgroup_unlock();
> > +
> > + return 0;
> > +}
> > +
> > +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
> > +{
> > + struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> > + struct bpf_mthp_ops *ops = kdata;
> > + struct cgroup *cgrp = st_link->cgroup;
> ^^^^^^^^^^^^
> Can this dereference NULL? Looking at bpf_struct_ops_map_delete_elem in
> kernel/bpf/bpf_struct_ops.c:936, it calls unreg with link = NULL. When
> link is NULL, st_link->cgroup dereferences NULL.

Yeah, I will fix that in the next version. Thanks!

> > + struct cgroup_subsys_state *pos;
> > +
> > + cgroup_lock();
> > + css_for_each_descendant_pre(pos, &cgrp->self) {
> > + struct cgroup *child = pos->cgroup;
> > +
> > + if (READ_ONCE(child->mthp_ops) == ops)
> > + WRITE_ONCE(child->mthp_ops, NULL);
> > + }
> > + cgroup_unlock();
> > +
> > + synchronize_srcu(&mthp_bpf_srcu);
> > +}
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25285279315