Re: [REGRESSION] RLIMIT_DATA crashes named

From: Sam Varshavchik
Date: Fri Sep 16 2016 - 18:36:16 EST


Linus Torvalds writes:

On Fri, Sep 16, 2016 at 1:10 PM, Laura Abbott <labbott@xxxxxxxxxx> wrote:
>
> As far as I can tell this isn't Fedora specific.

Some googling does seem to say that "datalimit 20M" and "named.conf"
ends up being some really old default that just gets endlessly copied.

So no, it's not Fedora-specific per se.

I'll confirm that.

It's been sitting in my named.conf for at least ten years. I don't remember where it came from. The Google sources are very likely. I probably copied it, from some tutorial.

But I suspect most people with a named.conf did either

(a) get it from their distro and didn't change it and so if the
distro just updates theirs, things will automatically "just work"

(b) actually did write their own (or at least edited it), and knows
what they are doing, and have absolutely no problem removing or
updating that datalimit thing.

(b) in my case. Now that the root cause is mostly known, I'll just bump it up.

The really annoying thing seems to be that the kernel message has been
hidden too much. IOW, Sam in his bugzilla report clearly found the
system messages with

Sep 10 07:38:23 shorty systemd-coredump: Process 1651 (named) of
user 25 dumped core.

but for some reason never noticed the kernel saying (quoting Jason):

mmap: named (593): VmData 27566080 exceed data ulimit 20971520.
Update limits or use boot option ignore_rlimit_data

at the same time.

Ok, the kernel only says it *once*. Maybe Sam had it in his logs, but
didn't notice the initial failure (which would have had the kernel
message too), and he then looked at the logs for when he tried to
re-start.

I still have this log file. Looking over it, this is indeed what happened.

Or maybe the system logs don't have those kernel messages, which would
be a disaster.

So maybe we should just change the "pr_warn_once()" into
"pr_warn_ratelimited()", except the default rate limits for that are
wrong (we'd perhaps want something like "at most once every minute" or
similar, while the default rate limits are along the lines of "max 10
lines every 5 _seconds_").

Sam, do you end up seeing the kernel warning in your logs if you just
go back earlier in the boot?

Yes, I found it.

Sep 10 07:36:29 shorty kernel: mmap: named (1108): VmData 52588544 exceed data ulimit 20971520. Update limits or use boot option ignore_rlimit_data.

Now that I know what to search for: this appeared about 300 lines earlier in /var/log/messages.

When trying to figure out what's going on with named, searching backwards in time, and finding the logged segault @07:38:23, IIRC I only looked as far back until the @07:38:23 timestamp started, and did not see anything other the apparent segfault. Before that, /var/log/messages was full of other noise. The original named that was launched two minutes earlier was ancient history, by then.

All I saw was that named was apparently segfaulting after booting a new kernel. Ok, boot back to the previous kernel, search bugzilla to see if it was reported already, and, if not, create it yourself. That's what happened.