Re: [PATCH 02/13] uapi: General notification ring definitions [ver #4]

From: Darrick J. Wong
Date: Fri Jun 07 2019 - 11:19:19 EST


On Fri, Jun 07, 2019 at 03:17:40PM +0100, David Howells wrote:
> Add UAPI definitions for the general notification ring, including the
> following pieces:
>
> (1) struct watch_notification.
>
> This is the metadata header for each entry in the ring. It includes a
> type and subtype that indicate the source of the message
> (eg. WATCH_TYPE_MOUNT_NOTIFY) and the kind of the message
> (eg. NOTIFY_MOUNT_NEW_MOUNT).
>
> The header also contains an information field that conveys the
> following information:
>
> - WATCH_INFO_LENGTH. The size of the entry (entries are variable
> length).
>
> - WATCH_INFO_OVERRUN. If preceding messages were lost due to ring
> overrun or lack of memory.
>
> - WATCH_INFO_ENOMEM. If preceding messages were lost due to lack
> of memory.
>
> - WATCH_INFO_RECURSIVE. If the event detected was applied to
> multiple objects (eg. a recursive change to mount attributes).
>
> - WATCH_INFO_IN_SUBTREE. If the event didn't happen at the watched
> object, but rather to some related object (eg. a subtree mount
> watch saw a mount happen somewhere within the subtree).
>
> - WATCH_INFO_TYPE_FLAGS. Eight flags whose meanings depend on the
> message type.
>
> - WATCH_INFO_ID. The watch ID specified when the watchpoint was
> set.
>
> All the information in the header can be used in filtering messages at
> the point of writing into the buffer.
>
> (2) struct watch_queue_buffer.
>
> This describes the layout of the ring. Note that the first slots in
> the ring contain a special metadata entry that contains the ring
> pointers. The producer in the kernel knows to skip this and it has a
> proper header (WATCH_TYPE_META, WATCH_META_SKIP_NOTIFICATION) that
> indicates the size so that the ring consumer can handle it the same as
> any other record and just skip it.
>
> Note that this means that ring entries can never be split over the end
> of the ring, so if an entry would need to be split, a skip record is
> inserted to wrap the ring first; this is also WATCH_TYPE_META,
> WATCH_META_SKIP_NOTIFICATION.
>
> Signed-off-by: David Howells <dhowells@xxxxxxxxxx>

I'm starting by reading the uapi changes and the sample program...

> ---
>
> include/uapi/linux/watch_queue.h | 63 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 63 insertions(+)
> create mode 100644 include/uapi/linux/watch_queue.h
>
> diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
> new file mode 100644
> index 000000000000..c3a88fa5f62a
> --- /dev/null
> +++ b/include/uapi/linux/watch_queue.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_WATCH_QUEUE_H
> +#define _UAPI_LINUX_WATCH_QUEUE_H
> +
> +#include <linux/types.h>
> +
> +enum watch_notification_type {
> + WATCH_TYPE_META = 0, /* Special record */
> + WATCH_TYPE_MOUNT_NOTIFY = 1, /* Mount notification record */
> + WATCH_TYPE_SB_NOTIFY = 2, /* Superblock notification */
> + WATCH_TYPE_KEY_NOTIFY = 3, /* Key/keyring change notification */
> + WATCH_TYPE_BLOCK_NOTIFY = 4, /* Block layer notifications */
> +#define WATCH_TYPE___NR 5

Given the way enums work I think you can just make WATCH_TYPE___NR the
last element in the enum?

> +};
> +
> +enum watch_meta_notification_subtype {
> + WATCH_META_SKIP_NOTIFICATION = 0, /* Just skip this record */
> + WATCH_META_REMOVAL_NOTIFICATION = 1, /* Watched object was removed */
> +};
> +
> +/*
> + * Notification record
> + */
> +struct watch_notification {

Kind of a long name...

> + __u32 type:24; /* enum watch_notification_type */
> + __u32 subtype:8; /* Type-specific subtype (filterable) */

16777216 diferent types and 256 different subtypes? My gut instinct
wants a better balance, though I don't know where I'd draw the line.
Probably 12 bits for type and 10 for subtype? OTOH I don't have a good
sense of how many distinct notification types an XFS would want to send
back to userspace, and maybe 256 subtypes is fine. We could always
reserve another watch_notification_type if we need > 256.

Ok, no objections. :)

> + __u32 info;
> +#define WATCH_INFO_OVERRUN 0x00000001 /* Event(s) lost due to overrun */
> +#define WATCH_INFO_ENOMEM 0x00000002 /* Event(s) lost due to ENOMEM */
> +#define WATCH_INFO_RECURSIVE 0x00000004 /* Change was recursive */
> +#define WATCH_INFO_LENGTH 0x000001f8 /* Length of record / sizeof(watch_notification) */

This is a mask, isn't it? Could we perhaps have some helpers here?
Something along the lines of...

#define WATCH_INFO_LENGTH_MASK 0x000001f8
#define WATCH_INFO_LENGTH_SHIFT 3

static inline size_t watch_notification_length(struct watch_notification *wn)
{
return (wn->info & WATCH_INFO_LENGTH_MASK) >> WATCH_INFO_LENGTH_SHIFT *
sizeof(struct watch_notification);
}

static inline struct watch_notification *watch_notification_next(
struct watch_notification *wn)
{
return wn + ((wn->info & WATCH_INFO_LENGTH_MASK) >>
WATCH_INFO_LENGTH_SHIFT);
}

...so that we don't have to opencode all of the ring buffer walking
magic and stuff?

(I might also shorten the namespace from WATCH_INFO_ to WNI_ ...)

Hmm so the length field is 6 bits and therefore the maximum size of a
notification record is ... 63 * (sizeof(u32) * 2) = 504 bytes? Which
means that kernel users can send back a maximum payload of 496 bytes?
That's probably big enough for random fs notifications (bad metadata
detected, media errors, etc.)

Judging from the sample program I guess all that userspace does is
allocate a memory buffer and toss it into the kernel, which then
initializes the ring management variables, and from there we just scan
around the ring buffer every time poll(watch_fd) says there's something
to do? How does userspace tell the kernel the size of the ring buffer?

Does (watch_notification->info & WATCH_INFO_LENGTH) == 0 have any
meaning besides apparently "stop looking at me"?

> +#define WATCH_INFO_IN_SUBTREE 0x00000200 /* Change was not at watched root */
> +#define WATCH_INFO_TYPE_FLAGS 0x00ff0000 /* Type-specific flags */

WATCH_INFO_FLAG_MASK ?

> +#define WATCH_INFO_FLAG_0 0x00010000
> +#define WATCH_INFO_FLAG_1 0x00020000
> +#define WATCH_INFO_FLAG_2 0x00040000
> +#define WATCH_INFO_FLAG_3 0x00080000
> +#define WATCH_INFO_FLAG_4 0x00100000
> +#define WATCH_INFO_FLAG_5 0x00200000
> +#define WATCH_INFO_FLAG_6 0x00400000
> +#define WATCH_INFO_FLAG_7 0x00800000
> +#define WATCH_INFO_ID 0xff000000 /* ID of watchpoint */

WATCH_INFO_ID_MASK ?

> +#define WATCH_INFO_ID__SHIFT 24

Why double underscore here?

> +};
> +
> +#define WATCH_LENGTH_SHIFT 3
> +
> +struct watch_queue_buffer {
> + union {
> + /* The first few entries are special, containing the
> + * ring management variables.

The first /two/ entries, correct?

Also, weird multiline comment style.

> + */
> + struct {
> + struct watch_notification watch; /* WATCH_TYPE_META */

Do these structures have to be cache-aligned for the atomic reads and
writes to work?

--D

> + __u32 head; /* Ring head index */
> + __u32 tail; /* Ring tail index */
> + __u32 mask; /* Ring index mask */
> + } meta;
> + struct watch_notification slots[0];
> + };
> +};
> +
> +#endif /* _UAPI_LINUX_WATCH_QUEUE_H */
>