Re: kdbus: to merge or not to merge?

From: Andy Lutomirski
Date: Mon Aug 03 2015 - 19:03:00 EST


On Mon, Jun 22, 2015 at 11:06 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> 2. Kdbus introduces a novel buffering model. Receivers allocate a big
> chunk of what's essentially tmpfs space. Assuming that space is
> available (in a virtual memory sense), senders synchronously write to
> the receivers' tmpfs space. Broadcast senders synchronously write to
> *all* receivers' tmpfs space. I think that, regardless of
> implementation, this is problematic if the sender and the receiver are
> in different memcgs. Suppose that the message is to be written to a
> page in the receivers' tmpfs space that is not currently resident. If
> the write happens in the sender's memcg context, then a receiver can
> effectively allocate an unlimited number of pages in the sender's
> memcg, which will, in practice, be the init memcg if the sender is
> systemd. This breaks the memcg model. If, on the other hand, the
> sender writes to the receiver's tmpfs space in the receiver's memcg
> context, then the sender will block (or fail? presumably
> unpredictable failures are a bad thing) if the receiver's memcg is at
> capacity.

I realize that everyone is sick of this thread. Nonetheless, I should
emphasize that I'm actually serious about this issue. I got Fedora
Rawhide working under kdbus (thanks, everyone!), and I ran this little
program:

#include <systemd/sd-bus.h>
#include <err.h>

int main(int argc, char *argv[])
{
while (1) {
sd_bus *bus;
if (sd_bus_open_system(&bus) < 0) {
/* warn("sd_bus_open_system"); */
continue;
}
sd_bus_close(bus);
}
}

under both userspace dbus and under kdbus. Userspace dbus burns some
CPU -- no big deal. I expected kdbus to fail to scale and burn a
disproportionate amount of CPU (because I don't see how it /can/
scale). Instead it fell over completely. I didn't bother debugging
it, but offhand I'd guess that the system OOMed and didn't come back.

On very brief inspection, Rawhide seems to have a lot of kdbus
connections with 16MiB of mapped tmpfs stuff each. (53 of them
mapped, and I don't know how many exist with tmpfs backing but aren't
mapped. Presumably the number only goes up as the degree of reliance
on the userspace proxy goes down. As it stands, that's over 3GB of
uncommitted backing store that my test is likely to forcibly commit
very quickly.)

Frankly, I don't understand how it's possible to cleanly implement
kdbus' broadcast or lifetime semantics* in an environment with bounded
CPU or bounded memory. (And unbounded memory just changes the
problem, since the message backlog can just get worse and worse.)

I work in an industry in which lots of parties broadcast lots of data
to lots of people. If you try to drink from the firehose and you
can't swallow fast enough, either you need to throw something out (and
test your recovery code!) or you fail. At least in finance, no one
pretends that a global order of events in different cities is
practical.

* Detecting when when your peer goes away is, of course, a widely
encountered and widely solved problem. I don't know of any deployed
systems that solve it by broadcasting the lifetime of everything to
everyone and relying on those broadcasts going through, though.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/