Re: [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops

From: bot+bpf-ci

Date: Sun May 03 2026 - 13:35:50 EST

> diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
> new file mode 100644
> index 000000000000..1c8a6f7ad8f1
> --- /dev/null
> +++ b/include/linux/bpf_huge_memory.h
> @@ -0,0 +1,35 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_HUGE_MEMORY_H
> +#define __BPF_HUGE_MEMORY_H
> +
> +/**
> + * struct bpf_mthp_ops - BPF callbacks for mTHP operations
> + * @mthp_choose: Choose the custom mTHP orders
> + *
> + * This structure defines the interface for BPF programs to customize
> + * mTHP behavior through struct_ops programs.
> + */
> +struct bpf_mthp_ops {
> + unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders);
> +};
> +
> +#if defined(CONFIG_BPF_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BPF_SYSCALL)
> +/**
> + * bpf_mthp_choose: Choose the custom mTHP orders using bpf
^
The kernel-doc format for functions uses parentheses followed by a dash, not
a colon. When scripts/kernel-doc processes this file, will it skip this
function documentation?

> + * @mm: task mm_struct
> + * @orders: original orders
> + *
> + * Return suited mTHP orders.
> + */

[ ... ]

> diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> new file mode 100644
> index 000000000000..e34e0a35edac
> --- /dev/null
> +++ b/mm/bpf_huge_memory.c
> @@ -0,0 +1,169 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Huge memory related BPF code
> + *
> + * Author: Vernon Yang <yanglincheng@xxxxxxxxxx>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/srcu.h>
> +
> +/* Protects cgrp->mthp_ops pointer for read and write. */
> +DEFINE_SRCU(mthp_bpf_srcu);
> +
> +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
> +{
> + struct cgroup *cgrp;
> + struct mem_cgroup *memcg;
> + struct bpf_mthp_ops *ops;
> + int idx;
> +
> + memcg = get_mem_cgroup_from_mm(mm);
> + if (!memcg)
> + return orders;
> +
> + cgrp = memcg->css.cgroup;
> + ops = READ_ONCE(cgrp->mthp_ops);
^^^^^
> + if (unlikely(ops)) {
> + idx = srcu_read_lock(&mthp_bpf_srcu);
^^^^^^^^^^^^^^^^^
Could there be a use-after-free here? The pointer is read BEFORE entering
the SRCU read-side critical section. If bpf_mthp_ops_unreg runs on another
CPU between these two lines, synchronize_srcu returns immediately because
no reader holds the lock yet.

Race sequence:
(1) Thread A loads ops (non-NULL) above
(2) Thread A is preempted
(3) Thread B calls bpf_mthp_ops_unreg, clears the pointer, calls
synchronize_srcu which returns immediately
(4) Thread B frees the backing memory
(5) Thread A resumes, enters SRCU section, then dereferences ops->mthp_choose

For SRCU to protect the pointer across its use, the reader must hold
srcu_read_lock before reading the pointer. Would moving the srcu_read_lock
before the READ_ONCE prevent this race?

> + if (ops->mthp_choose)
> + orders = ops->mthp_choose(cgrp, orders);
> + srcu_read_unlock(&mthp_bpf_srcu, idx);
> + }
> +
> + mem_cgroup_put(memcg);
> +
> + return orders;
> +}

[ ... ]

> +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> +{
> + struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> + struct bpf_mthp_ops *ops = kdata;
> + struct cgroup *cgrp = st_link->cgroup;
^^^^^^^^^^^^
Can this dereference NULL? When bpf_struct_ops_map_update_elem calls the
reg callback at kernel/bpf/bpf_struct_ops.c:873, it passes link = NULL for
maps without BPF_F_LINK. When link is NULL, st_link->cgroup dereferences
NULL before the check at the next line runs.

> + struct cgroup_subsys_state *pos;
> +
> + /* The link is not yet fully initialized, but cgroup should be set */
> + if (!link)
> + return -EOPNOTSUPP;

Does this comment match reality? Looking at bpf_struct_ops_link_create in
kernel/bpf/bpf_struct_ops.c:1397, the link was fully initialized by
bpf_link_init and bpf_link_prime before calling reg. Also, struct
bpf_struct_ops_link does not have a cgroup field in its definition at
kernel/bpf/bpf_struct_ops.c:58.

Also, does struct bpf_struct_ops_link actually have a cgroup field? Its
definition shows only { struct bpf_link link; struct bpf_map __rcu *map;
wait_queue_head_t wait_hup; }. The patch does not add this field to the
struct. Would this cause a compilation error?

> +
> + cgroup_lock();
> + css_for_each_descendant_pre(pos, &cgrp->self) {
> + struct cgroup *child = pos->cgroup;
> +
> + if (READ_ONCE(child->mthp_ops)) {
> + /* TODO
^^^^^
This isn't a bug, but per BPF comment style (subsystem/bpf.md), multi-line
comments should have the opening /* on its own line in BPF-related files.

> + * Do not destroy the cgroup hierarchy property.
> + * If an eBPF program already exists in the sub-cgroup,
> + * trigger an error and clear the already set
> + * bpf_mthp_ops data.
> + */
> + continue;
> + }
> + WRITE_ONCE(child->mthp_ops, ops);
> + }
> + cgroup_unlock();
> +
> + return 0;
> +}
> +
> +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> + struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> + struct bpf_mthp_ops *ops = kdata;
> + struct cgroup *cgrp = st_link->cgroup;
^^^^^^^^^^^^
Can this dereference NULL? Looking at bpf_struct_ops_map_delete_elem in
kernel/bpf/bpf_struct_ops.c:936, it calls unreg with link = NULL. When
link is NULL, st_link->cgroup dereferences NULL.

> + struct cgroup_subsys_state *pos;
> +
> + cgroup_lock();
> + css_for_each_descendant_pre(pos, &cgrp->self) {
> + struct cgroup *child = pos->cgroup;
> +
> + if (READ_ONCE(child->mthp_ops) == ops)
> + WRITE_ONCE(child->mthp_ops, NULL);
> + }
> + cgroup_unlock();
> +
> + synchronize_srcu(&mthp_bpf_srcu);
> +}

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25285279315