Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods

From: K Prateek Nayak
Date: Tue May 30 2023 - 04:07:00 EST


Hello Tejun,

I ran into a NULL pointer dereferencing issue when trying to test a build
of the "affinity-scopes-v1" branch from your workqueue tree
(https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/?h=affinity-scopes-v1)
Inlining the splat, some debug details, and workaround I used below.

On 5/19/2023 5:46 AM, Tejun Heo wrote:
> While renamed to pod, the code still assumes that the pods are defined by
> NUMA boundaries. Let's generalize it:
>
> * workqueue_attrs->affn_scope is added. Each enum represents the type of
> boundaries that define the pods. There are currently two scopes -
> WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
> - one pod per NUMA node. The latter defines one global pod across the
> whole system.
>
> * struct wq_pod_type is added which describes how pods are configured for
> each affnity scope. For each pod, it lists the member CPUs and the
> preferred NUMA node for memory allocations. The reverse mapping from CPU
> to pod is also available.
>
> * wq_pod_enabled is dropped. Pod is now always enabled. The previously
> disabled behavior is now implemented through WQ_AFFN_SYSTEM.
>
> * get_unbound_pool() wants to determine the NUMA node to allocate memory
> from for the new pool. The variables are renamed from node to pod but the
> logic still assumes they're one and the same. Clearly distinguish them -
> walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
> NUMA node.
>
> * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
> node. Take @cpu instead and determine the cpumask to use from the pod_type
> matching @attrs.
>
> * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
> NULL so that it can indicate -EINVAL on invalid affinity scopes.
>
> This patch allows CPUs to be grouped into pods however desired per type.
> While this patch causes some internal behavior changes, nothing material
> should change for workqueue users.
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> ---
> include/linux/workqueue.h | 31 +++++++-
> kernel/workqueue.c | 154 ++++++++++++++++++++++++--------------
> 2 files changed, 125 insertions(+), 60 deletions(-)
>
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index b8961c8ea5b3..a2f826b6ec9a 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -124,6 +124,15 @@ struct rcu_work {
> struct workqueue_struct *wq;
> };
>
> +enum wq_affn_scope {
> + WQ_AFFN_NUMA, /* one pod per NUMA node */
> + WQ_AFFN_SYSTEM, /* one pod across the whole system */
> +
> + WQ_AFFN_NR_TYPES,
> +
> + WQ_AFFN_DFL = WQ_AFFN_NUMA,
> +};
> +
> /**
> * struct workqueue_attrs - A struct for workqueue attributes.
> *
> @@ -140,12 +149,26 @@ struct workqueue_attrs {
> */
> cpumask_var_t cpumask;
>
> + /*
> + * Below fields aren't properties of a worker_pool. They only modify how
> + * :c:func:`apply_workqueue_attrs` select pools and thus don't
> + * participate in pool hash calculations or equality comparisons.
> + */
> +
> /**
> - * @ordered: work items must be executed one by one in queueing order
> + * @affn_scope: unbound CPU affinity scope
> *
> - * Unlike other fields, ``ordered`` isn't a property of a worker_pool. It
> - * only modifies how :c:func:`apply_workqueue_attrs` select pools and thus
> - * doesn't participate in pool hash calculations or equality comparisons.
> + * CPU pods are used to improve execution locality of unbound work
> + * items. There are multiple pod types, one for each wq_affn_scope, and
> + * every CPU in the system belongs to one pod in every pod type. CPUs
> + * that belong to the same pod share the worker pool. For example,
> + * selecting %WQ_AFFN_NUMA makes the workqueue use a separate worker
> + * pool for each NUMA node.
> + */
> + enum wq_affn_scope affn_scope;
> +
> + /**
> + * @ordered: work items must be executed one by one in queueing order
> */
> bool ordered;
> };
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index add6f5fc799b..dae1787833cb 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -325,7 +325,18 @@ struct workqueue_struct {
>
> static struct kmem_cache *pwq_cache;
>
> -static cpumask_var_t *wq_pod_cpus; /* possible CPUs of each node */
> +/*
> + * Each pod type describes how CPUs should be grouped for unbound workqueues.
> + * See the comment above workqueue_attrs->affn_scope.
> + */
> +struct wq_pod_type {
> + int nr_pods; /* number of pods */
> + cpumask_var_t *pod_cpus; /* pod -> cpus */
> + int *pod_node; /* pod -> node */
> + int *cpu_pod; /* cpu -> pod */
> +};
> +
> +static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
>
> /*
> * Per-cpu work items which run for longer than the following threshold are
> @@ -341,8 +352,6 @@ module_param_named(power_efficient, wq_power_efficient, bool, 0444);
>
> static bool wq_online; /* can kworkers be created yet? */
>
> -static bool wq_pod_enabled; /* unbound CPU pod affinity enabled */
> -
> /* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
> static struct workqueue_attrs *wq_update_pod_attrs_buf;
>
> @@ -1753,10 +1762,6 @@ static int select_numa_node_cpu(int node)
> {
> int cpu;
>
> - /* No point in doing this if NUMA isn't enabled for workqueues */
> - if (!wq_pod_enabled)
> - return WORK_CPU_UNBOUND;
> -
> /* Delay binding to CPU if node is not valid or online */
> if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
> return WORK_CPU_UNBOUND;
> @@ -3639,6 +3644,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void)
> goto fail;
>
> cpumask_copy(attrs->cpumask, cpu_possible_mask);
> + attrs->affn_scope = WQ_AFFN_DFL;
> return attrs;
> fail:
> free_workqueue_attrs(attrs);
> @@ -3650,11 +3656,13 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
> {
> to->nice = from->nice;
> cpumask_copy(to->cpumask, from->cpumask);
> +
> /*
> - * Unlike hash and equality test, this function doesn't ignore
> - * ->ordered as it is used for both pool and wq attrs. Instead,
> - * get_unbound_pool() explicitly clears ->ordered after copying.
> + * Unlike hash and equality test, copying shouldn't ignore wq-only
> + * fields as copying is used for both pool and wq attrs. Instead,
> + * get_unbound_pool() explicitly clears the fields.
> */
> + to->affn_scope = from->affn_scope;
> to->ordered = from->ordered;
> }
>
> @@ -3680,6 +3688,24 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
> return true;
> }
>
> +/* find wq_pod_type to use for @attrs */
> +static const struct wq_pod_type *
> +wqattrs_pod_type(const struct workqueue_attrs *attrs)
> +{
> + struct wq_pod_type *pt = &wq_pod_types[attrs->affn_scope];
> +
> + if (likely(pt->nr_pods))
> + return pt;
> +
> + /*
> + * Before workqueue_init_topology(), only SYSTEM is available which is
> + * initialized in workqueue_init_early().
> + */
> + pt = &wq_pod_types[WQ_AFFN_SYSTEM];
> + BUG_ON(!pt->nr_pods);
> + return pt;
> +}
> +
> /**
> * init_worker_pool - initialize a newly zalloc'd worker_pool
> * @pool: worker_pool to initialize
> @@ -3880,10 +3906,10 @@ static void put_unbound_pool(struct worker_pool *pool)
> */
> static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
> {
> + struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_NUMA];
> u32 hash = wqattrs_hash(attrs);
> struct worker_pool *pool;
> - int pod;
> - int target_pod = NUMA_NO_NODE;
> + int pod, node = NUMA_NO_NODE;
>
> lockdep_assert_held(&wq_pool_mutex);
>
> @@ -3895,28 +3921,24 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
> }
> }
>
> - /* if cpumask is contained inside a pod, we belong to that pod */
> - if (wq_pod_enabled) {
> - for_each_node(pod) {
> - if (cpumask_subset(attrs->cpumask, wq_pod_cpus[pod])) {
> - target_pod = pod;
> - break;
> - }
> + /* If cpumask is contained inside a NUMA pod, that's our NUMA node */
> + for (pod = 0; pod < pt->nr_pods; pod++) {
> + if (cpumask_subset(attrs->cpumask, pt->pod_cpus[pod])) {
> + node = pt->pod_node[pod];
> + break;
> }
> }
>
> /* nope, create a new one */
> - pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, target_pod);
> + pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, node);
> if (!pool || init_worker_pool(pool) < 0)
> goto fail;
>
> copy_workqueue_attrs(pool->attrs, attrs);
> - pool->node = target_pod;
> + pool->node = node;
>
> - /*
> - * ordered isn't a worker_pool attribute, always clear it. See
> - * 'struct workqueue_attrs' comments for detail.
> - */
> + /* clear wq-only attr fields. See 'struct workqueue_attrs' comments */
> + pool->attrs->affn_scope = WQ_AFFN_NR_TYPES;

When booting into the kernel, I got the following NULL pointer
dereferencing error:

[ 4.280321] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 4.284172] #PF: supervisor read access in kernel mode
[ 4.284172] #PF: error_code(0x0000) - not-present page
[ 4.284172] PGD 0 P4D 0
[ 4.284172] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 4.284172] CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.4.0-rc1-tj-wq+ #448
[ 4.284172] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
[ 4.284172] RIP: 0010:wq_calc_pod_cpumask+0x5a/0x180
[ 4.284172] Code: 24 e0 08 94 96 4d 8d bc 24 e0 08 94 96 85 d2 0f 84 ec 00 00 00 49 8b 47 18 49 63 f5 48 8b 53 08 48 8b 7b 10 8b 0d 56 4b f0 01 <4c> 63 2c b0 49 8b 47 08 4a 8b 34 e8 e8 25 4f 63 00 48 8b 7b 10 8b
[ 4.284172] RSP: 0018:ffff9b548d20fd88 EFLAGS: 00010286
[ 4.284172] RAX: 0000000000000000 RBX: ffff8baec0048380 RCX: 0000000000000100
[ 4.284172] RDX: ffff8baec0159440 RSI: 0000000000000001 RDI: ffff8baec0159dc0
[ 4.284172] RBP: ffff9b548d20fdb0 R08: ffffffffffffffff R09: ffffffffffffffff
[ 4.284172] R10: ffffffffffffffff R11: ffffffffffffffff R12: 00000000000000a0
[ 4.284172] R13: 0000000000000001 R14: ffffffffffffffff R15: ffffffff96940980
[ 4.284172] FS: 0000000000000000(0000) GS:ffff8bed3d240000(0000) knlGS:0000000000000000
[ 4.284172] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.284172] CR2: 0000000000000004 CR3: 000000530ae3c001 CR4: 0000000000770ee0
[ 4.284172] PKRU: 55555554
[ 4.284172] Call Trace:
[ 4.284172] <TASK>
[ 4.284172] wq_update_pod+0x89/0x1e0
[ 4.284172] workqueue_online_cpu+0x1fc/0x250
[ 4.284172] ? watchdog_nmi_enable+0x12/0x20
[ 4.284172] ? __pfx_workqueue_online_cpu+0x10/0x10
[ 4.284172] cpuhp_invoke_callback+0x165/0x4b0
[ 4.284172] ? try_to_wake_up+0x279/0x690
[ 4.284172] cpuhp_thread_fun+0xc4/0x1b0
[ 4.284172] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 4.284172] smpboot_thread_fn+0xe7/0x1e0
[ 4.284172] kthread+0xfb/0x130
[ 4.284172] ? __pfx_kthread+0x10/0x10
[ 4.284172] ret_from_fork+0x2c/0x50
[ 4.284172] </TASK>
[ 4.284172] Modules linked in:
[ 4.284172] CR2: 0000000000000004
[ 4.284172] ---[ end trace 0000000000000000 ]---

I was basically hitting the following, seemingly impossible scenario in
wqattrs_pod_type():

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3825,8 +3825,10 @@ wqattrs_pod_type(const struct workqueue_attrs *attrs)
{
struct wq_pod_type *pt = &wq_pod_types[attrs->affn_scope];

- if (likely(pt->nr_pods))
+ if (likely(pt->nr_pods)) {
+ BUG_ON(!pt->cpu_pod); /* No pods despite thinking we have? */
return pt;
+ }

/*
* Before workqueue_init_topology(), only SYSTEM is available which is
--

Logging the value of "attrs->affn_scope" when hitting the scenario gave
me "5" which corresponds to "WQ_AFFN_NR_TYPES". The kernel was reading a
value beyond the wq_pod_types[] bounds.

This value for "affn_scope" is only set in the above hunk and I got the
kernel to boot by making the following change:

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4069,7 +4071,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
pool->node = node;

/* clear wq-only attr fields. See 'struct workqueue_attrs' comments */
- pool->attrs->affn_scope = WQ_AFFN_NR_TYPES;
+ pool->attrs->affn_scope = wq_affn_dfl;
pool->attrs->localize = false;
pool->attrs->ordered = false;
--

Let me know if the above approach is correct since I'm not sure whether
the case for "affn_scope" being set to "WQ_AFFN_NR_TYPES" has a special
significance that is being handled elsewhere. Thank you :)

> pool->attrs->ordered = false;
>
> if (worker_pool_assign_id(pool) < 0)
> [..snip..]
--
Thanks and Regards,
Prateek