Re: [PATCH] kobject: record and send PREV_SEQNUM with uevents

From: Peter Rajnoha
Date: Wed Oct 18 2017 - 05:42:21 EST


On 10/18/2017 09:21 AM, Peter Rajnoha wrote:
> On 10/17/2017 03:26 PM, Greg KH wrote:
>> On Tue, Oct 17, 2017 at 12:04:41PM +0200, Peter Rajnoha wrote:
>>> On 10/16/2017 04:49 PM, Greg KH wrote:
>>>> On Mon, Oct 16, 2017 at 04:00:26PM +0200, Peter Rajnoha wrote:
>>>>> On 10/16/2017 03:55 PM, Greg KH wrote:
>>>>>> On Mon, Oct 16, 2017 at 02:35:44PM +0200, Peter Rajnoha wrote:
>>>>>>> Record last uevent seqnum that was used with kobject's uevent and send
>>>>>>> it with next uevent as PREV_SEQNUM=<num> variable.
>>>>>>>
>>>>>>> This enhances the way we are able to track and handle uevents in userspace
>>>>>>> if we are trying to catch up with all the uevents that were generated
>>>>>>> while we were not listening or the uevents which were not delivered.
>>>>>>> This may happen if the userspace listener is not running yet when the
>>>>>>> uevent is triggered or there's a period of time when it's not listening
>>>>>>> (e.g. because the userspace listener is being restarted while a new
>>>>>>> uevent fires).
>>>>>>>
>>>>>>> A userspace listener can compare the last uevent seqnum it knows about
>>>>>>> with the last uevent seqnum the kernel reports through uevents delivered
>>>>>>> to userspace, both genuine and synthetic ones (the ones generated by
>>>>>>> writing uevent action name to /sys/.../uevent file). Then, if needed,
>>>>>>> userspace may execute appropriate actions based on that. We don't need
>>>>>>> to reevaluate and recheck all items, instead, we can concentrate only
>>>>>>> on items for which we really and actually need to reflect changed state
>>>>>>> in kernel while the userspace listener was not listening for uevents.
>>>>>>>
>>>>>>> Signed-off-by: Peter Rajnoha <prajnoha@xxxxxxxxxx>
>>>>>>> ---
>>>>>>> include/linux/kobject.h | 1 +
>>>>>>> lib/kobject.c | 1 +
>>>>>>> lib/kobject_uevent.c | 15 +++++++++++++--
>>>>>>> 3 files changed, 15 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/kobject.h b/include/linux/kobject.h
>>>>>>> index e0a6205caa71..ee6db28b6186 100644
>>>>>>> --- a/include/linux/kobject.h
>>>>>>> +++ b/include/linux/kobject.h
>>>>>>> @@ -73,6 +73,7 @@ struct kobject {
>>>>>>> #ifdef CONFIG_DEBUG_KOBJECT_RELEASE
>>>>>>> struct delayed_work release;
>>>>>>> #endif
>>>>>>> + u64 last_uevent_seqnum;
>>>>>>> unsigned int state_initialized:1;
>>>>>>> unsigned int state_in_sysfs:1;
>>>>>>> unsigned int state_add_uevent_sent:1;
>>>>>>> diff --git a/lib/kobject.c b/lib/kobject.c
>>>>>>> index 763d70a18941..2cc809f10ae7 100644
>>>>>>> --- a/lib/kobject.c
>>>>>>> +++ b/lib/kobject.c
>>>>>>> @@ -190,6 +190,7 @@ static void kobject_init_internal(struct kobject *kobj)
>>>>>>> return;
>>>>>>> kref_init(&kobj->kref);
>>>>>>> INIT_LIST_HEAD(&kobj->entry);
>>>>>>> + kobj->last_uevent_seqnum = 0;
>>>>>>> kobj->state_in_sysfs = 0;
>>>>>>> kobj->state_add_uevent_sent = 0;
>>>>>>> kobj->state_remove_uevent_sent = 0;
>>>>>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>>>>>> index f237a09a5862..f1cbed06a395 100644
>>>>>>> --- a/lib/kobject_uevent.c
>>>>>>> +++ b/lib/kobject_uevent.c
>>>>>>> @@ -454,8 +454,19 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
>>>>>>> }
>>>>>>>
>>>>>>> mutex_lock(&uevent_sock_mutex);
>>>>>>> - /* we will send an event, so request a new sequence number */
>>>>>>> - retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
>>>>>>> + /*
>>>>>>> + * We will send an event, so request a new sequence number.
>>>>>>> + * Also, record the number for next uevent's PREV_SEQNUM.
>>>>>>> + */
>>>>>>> + kobj->last_uevent_seqnum = ++uevent_seqnum;
>>>>>>> + retval = add_uevent_var(env, "SEQNUM=%llu",
>>>>>>> + (unsigned long long)uevent_seqnum);
>>>>>>> + if (retval) {
>>>>>>> + mutext_unlock(&uevent_sock_mutex);
>>>>>>> + goto exit;
>>>>>>> + }
>>>>>>> + retval = add_uevent_var(env, "PREV_SEQNUM=%llu",
>>>>>>> + (unsigned long long)kobj->last_uevent_seqnum);
>>>>>>
>>>>>> I'm confused, why are we storing "last_uevent_seqnum" here at all, if it
>>>>>> is only just the current one-1? What else do you do with that value?
>>>>> I'm sorry, I've sent incorrect patch by mistake. This is the correct one:
>>>>
>>>> Ok, this is a bit better (but it's not in a format I could apply it in,
>>>> if I wanted to...)
>>>>
>>>>> From 91fa0d0f5c52b959a5ca4cc79d1a4f5970db75e2 Mon Sep 17 00:00:00 2001
>>>>> From: Peter Rajnoha <prajnoha@xxxxxxxxxx>
>>>>> Date: Thu, 12 Oct 2017 11:33:48 +0200
>>>>> Subject: [PATCH] kobject: record and send PREV_SEQNUM with uevents
>>>>>
>>>>> Record last uevent seqnum that was used with kobject's uevent and send
>>>>> it with next uevent as PREV_SEQNUM=<num> variable.
>>>>>
>>>>> This enhances the way we are able to track and handle uevents in userspace
>>>>> if we are trying to catch up with all the uevents that were generated
>>>>> while we were not listening or the uevents which were not delivered.
>>>>> This may happen if the userspace listener is not running yet when the
>>>>> uevent is triggered or there's a period of time when it's not listening
>>>>> (e.g. because the userspace listener is being restarted while a new
>>>>> uevent fires).
>>>>>
>>>>> A userspace listener can compare the last uevent seqnum it knows about
>>>>> with the last uevent seqnum the kernel reports through uevents delivered
>>>>> to userspace, both genuine and synthetic ones (the ones generated by
>>>>> writing uevent action name to /sys/.../uevent file). Then, if needed,
>>>>> userspace may execute appropriate actions based on that. We don't need
>>>>> to reevaluate and recheck all items, instead, we can concentrate only
>>>>> on items for which we really and actually need to reflect changed state
>>>>> in kernel while the userspace listener was not listening for uevents.
>>>>>
>>>>> Signed-off-by: Peter Rajnoha <prajnoha@xxxxxxxxxx>
>>>>
>>>> I don't really understand what userspace can do with this. Why does it
>>>> matter what the sequence number was for a specific kobject? What can
>>>> happen because of this?
>>>>
>>>> Do you have any example libudev code changes to show this in action so I
>>>> can maybe understand the need?
>>>>
>>>
>>> Not a libudev code change for this, at least I don't have this in plan
>>> at this moment.
>>>
>>> It's more related to the way the udev rules are processed and, in
>>> general, the way any other uevent listener can react to uevents. So the
>>> change is in how "end consumer" of the uevent does its processing.
>>>
>>> At the moment, the sequence number is used globally, we don't have
>>> per-kobject sequence numbers. So we can't use current SEQNUM that is
>>> attached to each uevent to tell whether we have missed previous
>>> uevent(s) for concrete device or not, we just know that some uevents for
>>> some devices were not processed. Therefore, when we're triggering
>>> uevents from userspace (udevadm trigger), we need to reevaluate all the
>>> rules for all devices, no matter if it's really needed or not.
>>>
>>> For example, when we're switching from initramfs to root fs, we stop the
>>> first udevd instance which is in initramfs, we keep the database (at the
>>> moment, only for selected devices where we know that it's safe to do so
>>> - same rules used in initramfs as in root fs). Then we switch over to
>>> root fs and start the udevd again and we trigger uevents to fully
>>> refresh the state, taking also the existing database into account.
>>>
>>> Now, it would help if we knew whether there were any uevents issued
>>> after we stopped udevd in initramfs and before we started a new one from
>>> root fs. If there are no uevents in between, we don't need to reevaluate
>>> all the rules, we can just keep the existing database content. This can
>>> become prominent if we have more devices in the system. It helps to
>>> minimize the number of execution during boot, the overall impact of
>>> triggering uevents and related processing, hence also making it more
>>> straightforward to watch the state of the device and possibly debug
>>> related problems.
>>>
>>> The same principle applies to any uevent listener in userspace, not just
>>> udevd:
>>>
>>> - recording last seen SEQNUM
>>>
>>> - when new instance of the listener is running
>>>
>>> - trigger synthetic uevents to refresh state
>>>
>>> - comparing the PREV_SEQNUM from uevent generated with stored SEQNUM:
>>
>> Why can't you just compare SEQNUM with the "last seen" SEQNUM, they
>> should just be 1 apart, right? Isn't that what is done today? How will
>> this differ from looking at PREV_SEQNUM?
>
> The SEQNUM is global, not per device (per kobject). It's 1 apart if you
> have only 1 device, but that doesn't apply if you have more devices with
> possible uevents interleaved.
>
> For example (just provoking uevents here for clear example):
>
> echo add > /sys/block/sda/uevent
> echo add > /sys/block/sda/uevent
> echo add > /sys/block/sdb/uevent
> echo add > /sys/block/sda/uevent
>
> Generates these uevents (I've removed uninteresting parts):
>
> KERNEL[45.739557] add
> ACTION=add
> DEVNAME=/dev/sda
> PREV_SEQNUM=2096
> SEQNUM=2112
>
> KERNEL[55.131767] add
> ACTION=add
> DEVNAME=/dev/sda
> PREV_SEQNUM=2112
> SEQNUM=2113
> SUBSYSTEM=block
>
> KERNEL[59.236131] add
> ACTION=add
> DEVNAME=/dev/sdb
> PREV_SEQNUM=2097
> SEQNUM=2114
>
> KERNEL[61.708551] add
> ACTION=add
> DEVNAME=/dev/sda
> PREV_SEQNUM=2113
> SEQNUM=2115
>
> Here, the PREV_SEQNUM is 1 apart with SEQNUM for the 2nd uevent (sda),
> because there was no other uevent (for a different device). But if you
> check the 4th uevent (sda) where previous one was generated for another
> device (sdb), the PREV_SEQNUM is not 1 apart. So in general, it may be
> "n" apart from previous SEQNUM for a *particular device*. Globally, for
> all devices, it is 1 apart, yes, but we don't care about that as that is
> not quite useful.
>
> So without PREV_SEQNUM, we can detect we have missed uevents, but we
> don't know for which devices exactly, so we need to refresh everything,
> not just the items we actually need. With PREV_SEQNUM we can identify
> exact devices.
Also, without PREV_SEQNUM, to tell at least whether I have missed some
uevents, I need to check /sys/kernel/uevent_seqnum and then compare with
what I have in userspace as "last seen seqnum". But that is a race -
after reading the uevent_seqnum file, more uevents may trigger...

I wouldn't need to read the uevent_seqnum file though if I just read the
SEQNUM directly from the uevent. But then, if I detected the SEQNUM is
not good (it's not the "last know seqnum" + 1), I'd need to trigger the
full refresh for all devices while processing that one particular
uevent, and that's becoming messy.

While with the PREV_SEQNUM, we don't need to care, we just trigger the
synthetic uevents whenever we need and we let the check to be done *per
each device* while processing those triggered uevents:

- if PREV_SEQNUM matches with what we have in records as "last seen"
for that particular device, we skip the uevent (we keep the old records
as they don't need any refresh)

- if PREV_SEQNUM doesn't match, we refresh ONLY that one device

So it's also cleaner way to deal with refreshes, besides the fact it's
done per-device, not the "all devices or nothing" like it is with having
only the global seqnum view.

--
Peter