Re: [PATCH 0/2] pipe: Fixes [ver #2]
From: Vincent Guittot
Date: Mon Dec 09 2019 - 04:54:03 EST
On Sat, 7 Dec 2019 at 23:48, Linus Torvalds
> On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > The "make goes slow" problem bisects down to b667b8673443 ("pipe:
> > Advance tail pointer inside of wait spinlock in pipe_read()").
> I'm not entirely sure that ends up being 100% true. It did bisect to
> that, but the behavior wasn't entirely stable. There definitely is
> some nasty timing trigger.
> But I did finally figure out what seems to have been going on with at
> least the biggest part of the build performance regression. It's seems
> to be a nasty interaction with the scheduler and the GNU make
> jobserver, and in particular the pipe wakeups really _really_ do seem
> to want to be synchronous both for the readers and the writers.
> When a writer wakes up a reader, we want the reader to react quickly
> and vice versa. The most obvious case was for the GNU make jobserver,
> where sub-makes would do a single-byte write to the jobserver pipe,
> and we want to wake up the reader *immediatly*, because the reader is
> actually a lot more important than the writer. The reader is what gets
> the next job going, the writer just got done with the last one.
> And when a reader empties a full pipe, it's because the writer is
> generating data, and you want to just get the writer going again asap.
> Anyway, I've spent way too much time looking at this and wondering
> about odd performance patterns. It seems to be mostly back up to
> I say "mostly", because I still see times of "not as many concurrent
> compiles going as I'd expect". It might be a kbuild problem, it might
> be an issue with GNU make (I've seen problems with the make jobserver
> wanting many more tokens than expected before and the kernel makefiles
> - it migth be about deep subdirectories etc), and it might be some
> remaining pipe issue. But my allmodconfig builds aren't _enormously_
> slower than they used to be.
> But there's definitely some unhappy interaction with the jobserver. I
> have 16 threads (8 cores with HT), and I generally use "make -j32" to
> keep them busy because the jobserver isn't great. The pipe rework made
> even that 2x slop not work all that well. Something held on to tokens
> too long, and there was definitely some interaction with the pipe
> wakeup code. Using "-j64" hid the problem, but it was a problem.
> It might be the new scheduler balancing changes that are interacting
> with the pipe thing. I'm adding PeterZ, Ingo and Vincent to the cc,
> because I hadn't realized just how important the sync wakeup seems to
> be for pipe performance even at a big level.
Which version of make should I use to reproduce the problem ?
My setup is not the same and my make is a bit old but I haven't been
able to reproduce the problem described above on my arm64 octa cores
system and v5.5-rc1.
All cores are busy with -j16. And even -j8 keeps the cores almost always busy
> I've pushed out my pipe changes. I really didn't want to do that kind
> of stuff at the end of the merge window, but I spent a lot more time
> than I wanted looking at this code, because I was getting to the point
> where the alternative was to just revert it all.
> DavidH, give these a look:
> 85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO
> a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic
> f467a6a66419 pipe: fix and clarify pipe read wakeup logic
> 1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic
> ad910e36da4c pipe: fix poll/select race introduced by the pipe rework
> the top two of which are purely "I'm fed up looking at this code, this
> needs to go" kind of changes.
> In particular, that last change is because I think the GNU jobserver
> problem is partly a thundering herd issue: when a job token becomes
> free (ie somebody does a one-byte write to an empty jobserver pipe),
> it wakes up *everybody* who is waiting for a token. One of them will
> get it, and the others will go to sleep again. And then it repeats all
> over. I didn't fix it, but it _could_ be fixed with exclusive waits
> for readers/writers, but that means more smarts than pipe_wait() can
> do. And because the jobserver isn't great at keeping everybody happy,
> I'm using a much bigger "make -jX" value than the number of CPU's I
> have, which makes the herd bigger. And I suspect none of this helps
> the scheduler pick the _right_ process to run, which just makes
> scheduling an even bigger problem.