Re: [PATCH] pid: Allow frozen userspace to reboot from non-init pid ns

From: Eric W. Biederman
Date: Wed Oct 11 2023 - 23:58:18 EST


Brian Geffon <bgeffon@xxxxxxxxxx> writes:

> On Fri, Sep 29, 2023 at 4:09 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>
>> On Fri, Sep 29, 2023 at 01:44:42PM -0400, Brian Geffon wrote:
>> > When the system has a frozen userspace, for example, during hibernation
>> > the child reaper task will also be frozen. Attmepting to deliver a
>> > signal to it to handle the reboot(2) will ultimately lead to the system
>> > hanging unless userspace is thawed.
>> >
>> > This change checks if the current task is the suspending task and if so
>> > it will allow it to proceed with a reboot from the non-init pid ns.
>>
>> I don't know the code flow too well here, but shouldn't init_pid_ns
>> always be doing the reboot regardless of anything else?
>
> I think the point of this is, normally the reaper is runnable and so
> an appropriate signal will be delivered allowing them to also clean up
> [2]. In our case, they won't be runnable and doing this wouldn't make
> sense.

The entire reboot_pid_ns thing is just a polite way of keeping
applications like /sbin/reboot working inside a pid namespace.

Ordinarily the process calling reboot (inside the container) won't
have the privileges to request an entire system reboot. So I don't
see anything making sense to promote that reboot into a system-wide
reboot.

Which leads me to the question. What is actually happening with
hibernation that we want something inside a pid namespace to somehow
have the permissions to reboot the entire machine?

>> Also how is this syscall running if current is frozen? This feels weird
>> to me... shouldn't the frozen test be against pid_ns->child_reaper
>> instead of current?
>
> The task which froze the system won't be frozen to make sure this
> happens it will have the flag PF_SUSPEND_TASK added, so we know if we
> have this flag we're the only running user space task [1].

Someone has a task inside a container that is successfully suspending
the entire system?

I don't see how that makes sense.

But on the level that it somehow does I would put a test in
kernel/reboot.c something like:

/*
* If the caller can't perform a normal reboot call
* reboot_pid_ns
*/
if ((pid_ns != &init_pid_ns) &&
!((current->flags & PF_SUSPEND_TASK) && capable(CAP_SYS_BOOT))) {
return reboot_pid_ns(pid_ns, cmd);
}

Making reboot_pid_ns responsible for the logic that should be bypassing
it is quite confusing.

> I hope my understanding is correct and it makes sense. Thanks for
> taking the time to review.
>
> Brian
>
> 1. https://elixir.bootlin.com/linux/latest/source/kernel/power/process.c#L130
> 2. https://elixir.bootlin.com/linux/latest/source/kernel/pid_namespace.c#L327


I really don't know if allowing PF_SUSPEND_TASK so that hibernation and
the like can work from inside a container makes any sense at all.

But the above is roughly how I would make it work.

Eric