Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

From: Tetsuo Handa
Date: Sat Jul 04 2020 - 02:59:00 EST


On 2020/07/04 7:25, Eric W. Biederman wrote:
> Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> writes:
>
>> On 2020/07/02 22:08, Eric W. Biederman wrote:
>>>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>>>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>>>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>>>> called from both kernel thread and from process context (e.g. init_module()
>>>> syscall by /sbin/insmod )) ?
>>>
>>> And __fput_sync needs to be even more careful.
>>> umd_load_blob is called in these changes without any locks held.
>>
>> But where is the guarantee that a thread which called flush_delayed_fput() waits for
>> the completion of processing _all_ "struct file" linked into delayed_fput_list ?
>> If some other thread or delayed_fput_work (scheduled by fput_many()) called
>> flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
>> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
>> needs to be processed before execve(), can't it?
>
> As a module the guarantee is we call task_work_run.

No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .

> Built into the kernel the guarantee as best I can trace it is that
> kthreadd hasn't started, and as such nothing that is scheduled has run
> yet.

Have you ever checked how early the kthreadd (PID=2) gets started?

----------
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2306,6 +2306,7 @@ static __latent_entropy struct task_struct *copy_process(
trace_task_newtask(p, clone_flags);
uprobe_copy_process(p, clone_flags);

+ printk(KERN_INFO "Created PID: %u Comm: %s\n", p->pid, p->comm);
return p;

bad_fork_cancel_cgroup:
----------

----------
[ 0.090757][ T0] pid_max: default: 65536 minimum: 512
[ 0.090890][ T0] LSM: Security Framework initializing
[ 0.090890][ T0] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[ 0.090890][ T0] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[ 0.090890][ T0] Disabled fast string operations
[ 0.090890][ T0] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[ 0.090890][ T0] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4
[ 0.090890][ T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.090890][ T0] Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available!
[ 0.090890][ T0] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[ 0.090890][ T0] SRBDS: Unknown: Dependent on hypervisor status
[ 0.090890][ T0] MDS: Mitigation: Clear CPU buffers
[ 0.090890][ T0] Freeing SMP alternatives memory: 24K
[ 0.090890][ T0] Created PID: 1 Comm: swapper/0
[ 0.090890][ T0] Created PID: 2 Comm: swapper/0
[ 0.090890][ T1] smpboot: CPU0: Intel(R) Core(TM) i5-4440S CPU @ 2.80GHz (family: 0x6, model: 0x3c, stepping: 0x3)
[ 0.091000][ T2] Created PID: 3 Comm: kthreadd
[ 0.091995][ T2] Created PID: 4 Comm: kthreadd
[ 0.093028][ T2] Created PID: 5 Comm: kthreadd
[ 0.093997][ T2] Created PID: 6 Comm: kthreadd
[ 0.094995][ T2] Created PID: 7 Comm: kthreadd
[ 0.096037][ T2] Created PID: 8 Comm: kthreadd
(...snipped...)
[ 0.135716][ T2] Created PID: 13 Comm: kthreadd
[ 0.135716][ T1] smp: Bringing up secondary CPUs ...
[ 0.135716][ T2] Created PID: 14 Comm: kthreadd
[ 0.135716][ T2] Created PID: 15 Comm: kthreadd
[ 0.135716][ T2] Created PID: 16 Comm: kthreadd
[ 0.135716][ T2] Created PID: 17 Comm: kthreadd
[ 0.135716][ T2] Created PID: 18 Comm: kthreadd
[ 0.135716][ T1] x86: Booting SMP configuration:
(...snipped...)
[ 0.901990][ T1] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[ 0.902145][ T1] pci 0000:00:0f.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[ 0.902213][ T1] pci 0000:02:00.0: CLS mismatch (32 != 64), using 64 bytes
[ 0.902224][ T1] Trying to unpack rootfs image as initramfs...
[ 1.107993][ T1] Freeing initrd memory: 18876K
[ 1.109049][ T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 1.111003][ T1] software IO TLB: mapped [mem 0xab000000-0xaf000000] (64MB)
[ 1.112136][ T1] check: Scanning for low memory corruption every 60 seconds
[ 1.115040][ T2] Created PID: 52 Comm: kthreadd
[ 1.116110][ T1] workingset: timestamp_bits=46 max_order=20 bucket_order=0
[ 1.120936][ T1] SGI XFS with ACLs, security attributes, verbose warnings, quota, no debug enabled
[ 1.129626][ T2] Created PID: 53 Comm: kthreadd
[ 1.131403][ T2] Created PID: 54 Comm: kthreadd
----------

kthreadd (PID=2) is created by swapper/0 (PID=0) immediately after init (PID=1) was created by
swapper/0 (PID=0). It is even before secondary CPUs are brought up, and far earlier than unpacking
initramfs.

And how can we prove that blob_to_mnt() is only called by a kernel thread before some kernel
thread that interferes fput() starts running? blob_to_mnt() needs to be prepared for being
called after many processes already started running.

>
>> Also, I don't know how convoluted the dependency of all "struct file" linked into
>> delayed_fput_list might be, for there can be "struct file" which will not be a
>> simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.
>>
>> On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
>> there is a guarantee that __fput_sync() waits for the completion of "struct file"
>> which needs to be flushed before execve(), isn't there?
>
> There is really not a good helper or helpers, and this code suggests we
> have something better. Right now I have used the existing helpers to
> the best of my ability. If you or someone else wants to write a better
> version of flushing so that exec can happen be my guest.
>
> As far as I can tell what I have is good enough.

Just saying what you think is not a "review". I'm waiting for answer from Al Viro
because I consider that Al will be the most familiar with fput()'s behavior.
At least I consider that

if (current->flags & PF_KTHREAD) {
__fput_sync(file);
} else {
fput(file);
task_work_run();
}

is a candidate for closing the race window. And depending on Al's answer,
removing

BUG_ON(!(task->flags & PF_KTHREAD));

from __fput_sync() and unconditionally using

__fput_sync(file);

from blob_to_mnt() might be the better choice. Anyway, I consider that
Al's response is important for this "review".

>
>>> We fundamentally AKA in any correct version of this code need to flush
>>> the file descriptor before we call exec or exec can not open it a
>>> read-only denying all writes from any other opens.
>>>
>>> The use case of flush_delayed_fput is exactly the same as that used
>>> when loading the initramfs.
>>
>> When loading the initramfs, the number of threads is quite few (which
>> means that the possibility of hitting the race window and convoluted
>> dependency is small).
>
> But the reality is the code run very early, before the initramfs is
> initialized in practice.

Such expectation is not a reality.

>
>> But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
>> flush_delayed_fput() might be called after many number of threads already
>> started running.
>
> At which point the code probably won't be runnig from a kernel thread
> but instead will be running in a thread where task_work_run is relevant.

No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .

>
> At worst it is a very small race, where someone else in another thread
> starts flushing the file. Which means the file could still be
> completely close before exec. Even that is not necessarily fatal,
> as the usermode driver code has a respawn capability.
>
> Code that is used enough that it hits that race sounds like a very
> good problem to have from the perspective of the usermode driver code.

In general, unconditionally retrying call_usermodehelper() when it returned
a negative value (e.g. -ENOENT, -ENOMEM, -EBUSY) is bad. I don't know which
code is an implementation of "a respawn capability"; I'd like to check where
that code is and whether that code is checking -ETXTBSY.