Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN

From: Casey Schaufler
Date: Mon May 22 2023 - 15:14:10 EST


On 5/21/2023 7:53 PM, Tianjia Zhang wrote:
> Hi Casey,
>
> On 5/18/23 8:01 AM, Casey Schaufler wrote:
>> On 5/16/2023 5:05 AM, Tianjia Zhang wrote:
>>> Hi Casey,
>>>
>>> On 5/12/23 12:17 AM, Casey Schaufler wrote:
>>>> On 5/11/2023 12:05 AM, Tianjia Zhang wrote:
>>>>> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
>>>>> For backward compatibility, the CAP_BLOCK_ADMIN capability is
>>>>> included
>>>>> within CAP_SYS_ADMIN.
>>>>>
>>>>> Some database products rely on shared storage to complete the
>>>>> write-once-read-multiple and write-multiple-read-multiple functions.
>>>>> When HA occurs, they rely on the PR (Persistent Reservations)
>>>>> protocol
>>>>> provided by the storage layer to manage block device permissions to
>>>>> ensure data correctness.
>>>>>
>>>>> CAP_SYS_ADMIN is required in the PR protocol implementation of
>>>>> existing
>>>>> block devices in the Linux kernel, which has too many sensitive
>>>>> permissions, which may lead to risks such as container escape. The
>>>>> kernel needs to provide more fine-grained permission management like
>>>>> CAP_NET_ADMIN to avoid online products directly relying on root to
>>>>> run.
>>>>>
>>>>> CAP_BLOCK_ADMIN can also provide support for other block device
>>>>> operations that require CAP_SYS_ADMIN capabilities in the future,
>>>>> ensuring that applications run with least privilege.
>>>>
>>>> Can you demonstrate that there are cases where a program that needs
>>>> CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other
>>>> operations?
>>>> How much of what's allowed by CAP_SYS_ADMIN would be allowed by
>>>> CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to
>>>> justify.
>>>>
>>>
>>> For the previous non-container scenarios, the block device is a shared
>>> device, because the business-system generally operates the file system
>>> on the block. Therefore, directly operating the block device has a high
>>> probability of affecting other processes on the same host, and it is a
>>> reasonable requirement to need the CAP_SYS_ADMIN capability.
>>>
>>> But for a database running in a container scenario, especially a
>>> container scenario on the cloud, it is likely that a container
>>> exclusively occupies a block device. That is to say, for a container,
>>> its access to the block device will not affect other process, there is
>>> no need to obtain a higher CAP_SYS_ADMIN capability.
>>
>> If I understand correctly, you're saying that the process that requires
>> CAP_BLOCK_ADMIN in the container won't also require CAP_SYS_ADMIN for
>> other operations.
>>
>> That's good, but it isn't clear how a process on bare metal would
>> require CAP_SYS_ADMIN while the same process in a container wouldn't.
>>
>>>
>>> For a file system similar to distributed write-once-read-many, it is
>>> necessary to ensure the correctness of recovery, then when recovery
>>> occurs, it is necessary to ensure that no inflighting-io is completed
>>> after recovery.
>>>
>>> This can be guaranteed by performing operations such as SCSI/NVME
>>> Persistent Reservations on block devices on the distributed file
>>> system.
>>
>> Does your cloud based system always run "real" devices? My
>> understanding is that cloud based deployment usually uses
>> virtual machines and virtio or other simulated devices.
>> A container deployment in the cloud seems unlikely to be able
>> to take advantage of block administration. But I can't say
>> I know the specifics of your environment.
>>
>>> Therefore, at present, it is only necessary to have the relevant
>>> permission support of the control command of such container-exclusive
>>> block devices.
>>
>> This looks like an extremely special case in which breaking out
>> block management would make sense.
>>
> Our scenario is like this. In simply terms, a distributed database has
> a read-write instance and one or more read-only instances. Each instance
> runs in an isolated container. All containers share the same block
> device.
>
> In addition to the database instance, there is also a control program
> running on the control plane in the container. The database ensures
> the correctness of the data through the PR (Persistent Reservations)
> of the block device. This operation is also the only operation in the
> container that requires CAP_SYS_ADMIN privileges.
>
> This system as a whole, whether it is running on VM or bare metal, the
> difference is not big.
>
> In order to support the PR of block devices, we need to grant
> CAP_SYS_ADMIN permissions to the container, which not only greatly
> increases the risk of container escape, but also makes us have to
> carefully configure the permissions of the container. Many container
> escapes that have occurred are also caused by these reasons.
>
> This is essentially a problem of permission isolation. We hope to
> share the smallest possible permissions from CAP_SYS_ADMIN to support
> necessary operations, and avoid providing CAP_SYS_ADMIN permissions
> to containers as much as possible.

Your use case is interesting, but not compelling. While you may have
come up with a specific case where you can completely break CAP_BLOCK_ADMIN
out from CAP_SYS_ADMIN, it's hardly general.

>
> Kind regards,
> Tianjia
>