Re: [RFC v2 1/2] kernel/sysctl: support setting sysctl parameters from kernel command line
From: Vlastimil Babka
Date: Thu Mar 26 2020 - 09:29:55 EST
On 3/25/20 11:20 PM, Eric W. Biederman wrote:
> Vlastimil Babka <vbabka@xxxxxxx> writes:
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -1980,6 +1980,68 @@ int __init sysctl_init(void)
>> return 0;
>> }
>>
>> +/* Set sysctl value passed on kernel command line. */
>> +int process_sysctl_arg(char *param, char *val,
>> + const char *unused, void *arg)
>> +{
>> + size_t count;
>> + char *remaining;
>> + int err;
>> + loff_t ppos = 0;
>> + struct ctl_table *ctl, *found = NULL;
>> +
>> + if (strncmp(param, "sysctl.", sizeof("sysctl.") - 1))
>> + return 0;
>
> Is there any way we can use a slash separated path. I know
We could, but as others explained, people and tools are used to the dot
separation, so I think the only sensible options are supporting only dot, or
both dot and slash.
> in practice there are not any sysctl names that don't have
> a '.' in them but why should we artifically limit ourselves?
Existing tools would probably break (or perhaps sysctl(8) is smarter than I
think, dunno).
> I guess as long as we don't mind not being able to set sysctls
> that have a '.' in them it doesn't matter.
Right.
>> +
>> + param += sizeof("sysctl.") - 1;
>> +
>> + remaining = param;
>> + ctl = &sysctl_base_table[0];
>> +
>> + while(ctl->procname != 0) {
> ^^^^^^^^^^^^^^^^^^
>
> Please either test "while(ctl->procname)" or
> "while(ctl->procname != NULL)" testing against 0 makes it look like
> procname is an integer. The style in the kernel is to test against
> NULL, to make it clear when something is a pointer.
OK
>> + int len = strlen(ctl->procname);
>
> You should have done "strchr(remaining)" and figured out if there is
> another '.' and only compared up to that dot. Probably skipping this
> entry entirely if the two lengths don't match.
That's also possible, but AFAICS my code works as intended, as I explained in a
reply to Kees, and also below:
>> + if (strncmp(remaining, ctl->procname, len)) {
>> + ctl++;
>> + continue;
>> + }
>> + if (ctl->child) {
>> + if (remaining[len] == '.') {
>> + remaining += len + 1;
>> + ctl = ctl->child;
>> + continue;
>> + }
>> + } else {
>> + if (remaining[len] == '\0') {
>> + found = ctl;
>> + break;
>> + }
>> + }
>> + ctl++;
>
> There should be exactly one match for a name a table.
> If you get here the code should break, not continue on.
If there existed e.g. both "vm.swap" and "vm.swappiness" options and user passed
"vm.swappiness=10", but the "swap" ctl entry was encountered first, it will
succeed the strncmp(), but then realize "swap" was just a prefix of what user
specified (remaining[len] is not '\0') and hence continue serching for other
matches.
>> + }
>> +
>> + if (!found) {
>> + pr_warn("Unknown sysctl param '%s' on command line", param);
>> + return 0;
>> + }
>> +
>> + if (!(found->mode & 0200)) {
>> + pr_warn("Cannot set sysctl '%s=%s' from command line - not writable",
>> + param, val);
>> + return 0;
>> + }
>> +
>> + count = strlen(val);
>> + err = found->proc_handler(found, 1, val, &count, &ppos);
>> +
>> + if (err)
>> + pr_warn("Error %d setting sysctl '%s=%s' from command line",
>> + err, param, val);
>> +
>> + pr_debug("Set sysctl '%s=%s' from command line", param, val);
>> +
>> + return 0;
>> +}
>
> You really should be able to have this code live in
> fs/proc/proc_sysctl.c and utilize lookup_entry.
>
> That should give you the ability to lookup any sysctl. If
> kernel/sysctl.c is compiled into the kernel proc_sysctl.c is compiled
> into the kernel. Systems that don't select CONFIG_PROC_SYSCTL won't
> have any sysctl tables installed at all so they do not make sense to
> consider or design for.
I see. In fact one reason why I tried to avoid the proc stuff was your commit
61a47c1ad3a4 ("sysctl: Remove the sysctl system call") and this part:
> As this removes one of the few uses of the internal kernel mount
> of proc I hope this allows for even more simplifications of the
> proc filesystem.
But if you now suggest using the kernel mount then sure, it I don't object make
the code simpler and handle all sysctls.
> Further it will be faster to lookup the sysctls using the code from
> proc_sysctl.c as it constructs an rbtree of all of the entries in
> a directory. The code might as well take advantage of that for large
> directories.
>
> Arguably the main sysctl tables in kernel/sysctl.c should be split up so
> that things are more localized and there is less global state exported
> throughout the kernel. I certainly don't want to discourage anyone from
> doing that just so their sysctl can be used on the command line.
Fair point.
> Hmm. There is a big gotcha in here and I think it should be mentioned.
> This code only works because no one has done set_fs(KERNEL_DS). Which
> means this only works with strings that are kernel addresses essentially
> by mistake. A big fat comment documenting why it is safe to pass in
> kernel addresses to a function that takes a "char __user*" pointer
> would be very good.
Thanks, didn't realize that.
> Eric
>