Re: sscanf: implement basic character sets

From: Jessica Yu
Date: Wed Feb 24 2016 - 00:13:56 EST


+++ Andrew Morton [23/02/16 14:05 -0800]:
On Tue, 23 Feb 2016 15:38:22 -0500 Jessica Yu <jeyu@xxxxxxxxxx> wrote:

Implement basic character sets for the '%[]' conversion specifier.

The '%[]' conversion specifier matches a nonempty sequence of characters
from the specified set of accepted (or with '^', rejected) characters
between the brackets. The substring matched is to be made up of characters
in (or not in) the set. This implementation differs from its glibc
counterpart in that it does not support character ranges (e.g., 'a-z' or
'0-9'), the hyphen '-' is *not* a special character, and the brackets
themselves cannot be matched.

Signed-off-by: Jessica Yu <jeyu@xxxxxxxxxx>
---

This patch adds support for the '%[' conversion specifier for sscanf().
This is useful in cases where we'd like to match substrings delimited by
something other than spaces. The original motivation for this patch
actually came from a livepatch discussion (See: https://lkml.org/lkml/2016/2/8/790),
where we were trying to come up with a clean way to parse symbol names with
substrings delimited by periods and commas.

It would be better to include the justification right here in the
changelog please. Not via some link-to-discussion and definitely not
below the ^--- marker! It's very important.

Thanks for the corrections Andrew. I am however slightly confused, are
you suggesting that I should provide a much more thorough explanation
about the motivation here in the changelog (below the ^--- marker), or
would this be better suited for a (separate) cover letter?

The deviation from the glibc behaviour is a bit of a worry,
particularly as it is done in a non-back-compat manner: code which
assumes "-" is non-magic might break if someone later adds range
support.

Presumably we can live with that - there won't be many callsites and
they can be grepped for. But please, let's get a description of all
these considerations into the code as a comment. Probably it would be
helpful to include a little usage example in that comment.

Hm, that is a very good point. At the moment we can be sure there
aren't any users of sscanf() using the %[ conversion specifier, as it
doesn't exist yet :-) But yes, this behavior should be documented
clearly in a comment, so future users will be aware..

--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -2714,6 +2714,47 @@ int vsscanf(const char *buf, const char *fmt, va_list args)
num++;
}
continue;
+ case '[':
+ {
+ char *s = (char *)va_arg(args, char *);
+ char *set;
+ size_t (*op)(const char *str, const char *set);
+ size_t len = 0;
+ bool negate = (*(fmt) == '^');
+
+ if (field_width == -1)
+ field_width = SHRT_MAX;
+
+ op = negate ? &strcspn : &strspn;
+ if (negate)
+ fmt++;
+
+ len = strcspn(fmt, "]");
+ /* invalid format; stop here */
+ if (!len)
+ return num;
+
+ set = kstrndup(fmt, len, GFP_KERNEL);

Embedding a GFP_KERNEL allocation into vsscanf is problematic - it
limits the situations in which this functionality can be used.

afaict the allocation is there merely so we can null-terminate the
string so we can use existing library functions (strcspn, strspn). Is
that compromise really worth it? We could pretty easily convert
strcspn() into

strcnspn(const char *s, const char *reject, size_t len)

and convert strcspn() to call that (ifndef __HAVE_ARCH_STRCSPN)

In fact I think we could still use strspn() and strcspn() on `fmt'
directly? We just need to check for the return value exceeding `len'
and if so, treat that as a no-match?


Perhaps we can use Rasmus' bitmap solution, as it avoids the
allocation altogether and it doesn't need to use strspn()/strcspn().

+ if (!set)
+ return num;
+
+ /* advance fmt past ']' */
+ fmt += len + 1;
+
+ len = op(str, set);
+ /* no matches */
+ if (!len) {
+ kfree(set);
+ return num;
+ }
+
+ while (len-- && field_width--)
+ *s++ = *str++;
+ *s = '\0';
+ kfree(set);
+ num++;
+ }
+ continue;
case 'o':
base = 8;
break;