Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
From: Eric W. Biederman
Date: Mon Jun 04 2018 - 10:45:04 EST
"Hatayama, Daisuke" <d.hatayama@xxxxxxxxxxxxxx> writes:
> Hello,
>
> Thanks for this patch, Eric.
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm@xxxxxxxxxxxx]
>> Sent: Monday, June 4, 2018 3:51 AM
>> To: Hatayama, Daisuke <d.hatayama@xxxxxxxxxxxxxx>
>> Cc: 'gregkh@xxxxxxxxxxxxxxxxxxx' <gregkh@xxxxxxxxxxxxxxxxxxx>;
>> 'tj@xxxxxxxxxx' <tj@xxxxxxxxxx>; Okajima, Toshiyuki
>> <toshi.okajima@xxxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
>> 'ebiederm@xxxxxxxxxxxxxxxxxx' <ebiederm@xxxxxxxxxxxxxxxxxx>
>> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
>>
>>
>> Over the years two bugs have crept into the code that handled the last
>> returned kernfs directory being deleted, and handled general seeking
>> in kernfs directories. The result is that reading the /sys/module
>> directory while moduled are loading and unloading will sometimes
>> skip over a module that was present for the entire duration of
>> the directory read.
>>
>> The first bug is that kernfs_dir_next_pos was advancing the position
>> in the directory even when kernfs_dir_pos had found the starting
>> kernfs directory entry was not usable. In this case kernfs_dir_pos is
>> guaranteed to return the directory entry past the starting directory
>> entry, making it so that advancing the directory position is wrong.
>>
>> The second bug is that kernfs_dir_pos when the saved kernfs_node
>> did not validate, was not always returning a directory after
>> the the target position, but was sometimes returning the directory
>> entry just before the target position.
>>
>> To correct these issues and to make the code easily readable and
>> comprehensible several cleanups are made.
>>
>> - A function kernfs_dir_next is factored out to make it straight-forward
>> to find the next node in a kernfs directory.
>>
>> - A function kernfs_dir_skip_pos is factored out to skip over directory
>> entries that should not be shown to userspace. This function is called
>> from kernfs_fop_readdir directly removing the complication of calling
>> it from the other directory advancement functions.
>>
>> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
>> to kernfs_fop_readdir. Keeping it with the kernfs_get when it is saved
>> and ensuring the directory position advancment functions can reference
>> the saved node without complications.
>>
>> - The function kernfs_dir_next_pos is modified to only advance the directory
>> position in the common case when kernfs_dir_pos validates and returns
>> the starting kernfs node. For all other cases kernfs_dir_pos is guaranteed
>> to return a directory position in advance of the starting directory position.
>>
>> - kernfs_dir_pos is given a significant make over. It's essense remains the
>> same but some siginficant changes were made.
>>
>> + The offset is validated to be a valid hash value at the very beginning.
>> The validation is updated to handle the fact that neither 0 nor 1 are
>> valid hash values.
>>
>> + An extra test is added at the end of the rbtree lookup to ensure
>> the returned position is at or beyond the target position.
>>
>> + kernfs_name_compare is used during the rbtree lookup instead of just
>> comparing
>> the hash. This allows the the passed namespace parameter to be used, and
>> if it is available from the saved entry the old saved name as well.
>>
>> + The validation of the saved kernfs node now checks for two cases.
>> If the saved node is returnable.
>> If the name of the saved node is usable during lookup.
>>
>> In net this should result in a easier to understand, and more correct
>> version of directory traversal for kernfs directories.
>>
>> Reported-by: "Hatayama, Daisuke" <d.hatayama@xxxxxxxxxxxxxx>
>> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
>> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
>> scalability v2")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
>> ---
>>
>> Can you test this and please verify it fixes your issue?
>>
>
> I tried this patch on top of v4.17 but the system fails to boot
> without detecting root disks by dracut like this:
>
> [ 196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 199.945999] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 200.490281] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 201.071177] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 201.074958] dracut-initqueue[324]: Warning: Could not boot.
> Starting Setup Virtual Console...
> [ OK ] Started Setup Virtual Console.
> [ 201.245921] audit: type=1130 audit(1528132266.260:12): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> [ 201.256378] audit: type=1131 audit(1528132266.260:13): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> Starting Dracut Emergency Shell...
> Warning: /dev/fedora/root does not exist
> Warning: /dev/fedora/swap does not exist
> Warning: /dev/mapper/fedora-root does not exist
>
> Generating "/run/initramfs/rdsosreport.txt"
>
>
> Entering emergency mode. Exit the shell to continue.
> Type "journalctl" to view system logs.
> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
> after mounting them and attach it to a bug report.
>
>
> dracut:/# ls /sys/module
> 8139cp dm_bufio kernel psmouse ttm
> 8139too dm_mirror keyboard pstore uhci_hcd
> 8250 dm_mod kgdboc qemu_fw_cfg usbcore
> acpi dm_snapshot kgdbts qxl usbhid
> acpiphp drm libahci random uv_nmi
> ahci drm_kms_helper libata rcupdate virtio
> ata_generic dynamic_debug md_mod rcutree virtio_console
> ata_piix edac_core mii rng_core virtio_pci
> battery ehci_hcd module scsi_mod virtio_ring
> block fb mousedev serio_raw vt
> button firmware_class pata_acpi sg watchdog
> configfs fscrypto pci_hotplug slab_common workqueue
> cpufreq hid pcie_aspm spurious xhci_hcd
> cpuidle hid_magicmouse pciehp sr_mod xz_dec
> crc32c_intel hid_ntrig pcmcia srcutree zswap
> cryptomgr i8042 pcmcia_core suspend
> debug_core intel_idle printk sysrq
> devres ipv6 processor tcp_cubic
> dracut:/# lsmod
> Module Size Used by
> qxl 77824 1
> drm_kms_helper 196608 1
> ttm 126976 1
> drm 454656 4 qxl,ttm
> virtio_console 32768 0
> ata_generic 16384 0
> crc32c_intel 24576 0
> 8139too 40960 0
> virtio_pci 28672 0
> 8139cp 32768 0
> virtio_ring 28672 2 virtio_pci
> serio_raw 16384 0
> pata_acpi 16384 0
> virtio 16384 2 virtio_pci
> mii 16384 2 8139too
> qemu_fw_cfg 16384 0
> dracut:/#
>
> OTOH, there's no issue on the pure v4.17 kernel.
>
> As above, ls /sys/module looks apparently good. But I guess any part of
> behavior of getdentries() on sysfs must have changed, affecting the disk
> detection...
I haven't been able to reproduce this yet. My test system boots fine.
Which fedora are you testing on?
>> fs/kernfs/dir.c | 109
>> ++++++++++++++++++++++++++++++++++----------------------
>> 1 file changed, 67 insertions(+), 42 deletions(-)
>>
>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>> index 89d1dc19340b..8148b5fec48d 100644
>> --- a/fs/kernfs/dir.c
>> +++ b/fs/kernfs/dir.c
>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode,
>> struct file *filp)
>> return 0;
>> }
>>
>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
>> +{
>> + struct rb_node *node = rb_next(&pos->rb);
>> + return node ? rb_to_kn(node) : NULL;
>> +}
>> +
>> static struct kernfs_node *kernfs_dir_pos(const void *ns,
>> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
>> {
>> - if (pos) {
>> - int valid = kernfs_active(pos) &&
>> - pos->parent == parent && hash == pos->hash;
>> - kernfs_put(pos);
>> - if (!valid)
>> - pos = NULL;
>> - }
>> - if (!pos && (hash > 1) && (hash < INT_MAX)) {
>> - struct rb_node *node = parent->dir.children.rb_node;
>> - while (node) {
>> - pos = rb_to_kn(node);
>> -
>> - if (hash < pos->hash)
>> - node = node->rb_left;
>> - else if (hash > pos->hash)
>> - node = node->rb_right;
>> - else
>> - break;
>> + struct kernfs_node *pos;
>> + struct rb_node *node;
>> + unsigned int hash;
>> + const char *name = "";
>> +
>> + /* Is off a valid name hash? */
>> + if ((off < 2) || (off >= INT_MAX))
>> + return NULL;
>> + hash = off;
>> +
>> + /* Is the saved position usable? */
>> + if (saved) {
>> + /* Proper parent and hash? */
>> + if ((parent != saved->parent) || (saved->hash != hash)) {
>> + saved = NULL;
>
> name is uninitialized in this path.
It is. name is initialized to "" see above.
>> + } else {
>> + if (kernfs_active(saved))
>> + return saved;
>> + name = saved->name;
>> }
>> }
>> - /* Skip over entries which are dying/dead or in the wrong namespace
>> */
>> - while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
>> - struct rb_node *node = rb_next(&pos->rb);
>> - if (!node)
>> - pos = NULL;
>> +
>> + /* Find the closest pos to the hash we are looking for */
>> + pos = NULL;
>> + node = parent->dir.children.rb_node;
>> + while (node) {
>> + int result;
>> +
>> + pos = rb_to_kn(node);
>> + result = kernfs_name_compare(hash, name, ns, pos);
>> + if (result < 0)
>> + node = node->rb_left;
>> + else if (result > 0)
>> + node = node->rb_right;
>> else
>> - pos = rb_to_kn(node);
>> + break;
>> }
>> +
>> + /* Ensure pos is at or beyond the target position */
>> + if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
>> + pos = kernfs_dir_next(pos);
>> +
>
> Why not trying to find the final target object here?
>
> Looking at the code, I think the operation needed here is to find the
> smallest active object in the same namespace from the objects larger
> than the saved object given as argument. The saved object appears to
> be used as cache. I think dividing the process into kernfs_dir_pos()
> is not necessary.
What you are suggesting below is wrong when restarting readdir.
Ordinarily saved is the next entry we are going to return from readdir.
When dir_emit does not succeed we stop returnning entries and when
we restart readdir we need to attempt dir_emit on saved again.
Always advancing past saved will cause us to skip saved in the
event a single readdir is not enough.
The restart (kernfs_dir_pos) must return saved or if saved is now gone,
the first entry past saved.
Saved dying in the middle is what I believe caused the original issue.
Making all of this clear is part of the reason I moved the
kernfs_dir_skip_pos logic into it's own function.
Eric
>
> I mean like this:
>
> # git diff
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 8148b5f..eeffc8c 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1605,13 +1605,13 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
>
> /* Is the saved position usable? */
> if (saved) {
> + name = saved->name;
> /* Proper parent and hash? */
> if ((parent != saved->parent) || (saved->hash != hash)) {
> saved = NULL;
> - } else {
> - if (kernfs_active(saved))
> - return saved;
> - name = saved->name;
> + } else if (kernfs_active(saved)) {
> + pos = saved;
> + goto skip;
> }
> }
>
> @@ -1631,22 +1631,14 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
> break;
> }
>
> +skip:
> /* Ensure pos is at or beyond the target position */
> - if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> + if (pos && (kernfs_name_compare(hash, name, ns, pos) <= 0))
> pos = kernfs_dir_next(pos);
>
> return pos;
> }
>
> -static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> - struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
> -{
> - struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> - if (likely(pos == start))
> - pos = kernfs_dir_next(pos);
> - return pos;
> -}
> -
> static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
> struct kernfs_node *pos)
> {
> @@ -1672,7 +1664,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
>
> for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
> (pos = kernfs_dir_skip_pos(ns, pos));
> - pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
> + pos = kernfs_dir_pos(ns, parent, ctx->pos, pos)) {
> const char *name = pos->name;
> unsigned int type = dt_type(pos);
> int len = strlen(name);
>
>> return pos;
>> }
>>
>> static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>> - struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
>> {
>> - pos = kernfs_dir_pos(ns, parent, ino, pos);
>> - if (pos) {
>> - do {
>> - struct rb_node *node = rb_next(&pos->rb);
>> - if (!node)
>> - pos = NULL;
>> - else
>> - pos = rb_to_kn(node);
>> - } while (pos && (!kernfs_active(pos) || pos->ns != ns));
>> - }
>> + struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
>> + if (likely(pos == start))
>> + pos = kernfs_dir_next(pos);
>> + return pos;
>> +}
>> +
>> +static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
>> + struct kernfs_node *pos)
>> +{
>> + /* Skip entries we don't want to return to userspace */
>> + while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
>> + pos = kernfs_dir_next(pos);
>> return pos;
>> }
>>
>> @@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file, struct
>> dir_context *ctx)
>> {
>> struct dentry *dentry = file->f_path.dentry;
>> struct kernfs_node *parent = kernfs_dentry_node(dentry);
>> - struct kernfs_node *pos = file->private_data;
>> + struct kernfs_node *pos, *saved = file->private_data;
>> const void *ns = NULL;
>>
>> if (!dir_emit_dots(file, ctx))
>> @@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file,
>> struct dir_context *ctx)
>> if (kernfs_ns_enabled(parent))
>> ns = kernfs_info(dentry->d_sb)->ns;
>>
>> - for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
>> - pos;
>> + for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
>> + (pos = kernfs_dir_skip_pos(ns, pos));
>> pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
>> const char *name = pos->name;
>> unsigned int type = dt_type(pos);
>> int len = strlen(name);
>> ino_t ino = pos->id.ino;
>>
>> - ctx->pos = pos->hash;
>> - file->private_data = pos;
>> - kernfs_get(pos);
>> + kernfs_put(saved);
>> + saved = pos;
>> + ctx->pos = saved->hash;
>> + file->private_data = saved;
>> + kernfs_get(saved);
>>
>> mutex_unlock(&kernfs_mutex);
>> if (!dir_emit(ctx, name, len, ino, type))
>> return 0;
>> mutex_lock(&kernfs_mutex);
>> }
>> + kernfs_put(saved);
>> mutex_unlock(&kernfs_mutex);
>> file->private_data = NULL;
>> ctx->pos = INT_MAX;
>> --
>> 2.14.1
>>
>>