I completely agree that we need to increase the memory dedicated to the
hash, but I fear that vmallocing a 1 megabyte table per domain (effectively
per mount) is going overboard. I will assume that this was a straw man patch
:)
Of course the Right Think To Do is kmalloc a vector of pages and mask the
hash into it, but as a lazy person I would rather sink the time into
writing this deathless prose.
Right. That's one possible approach - I was originally thinking of
allocating a large global lockres hash table allocated at module init time.
This way we don't have to make large allocs for each domain.
Before anything else though, I'd really like to get an idea of how large we
want things. This might very well dictate the severity of the solution.
I failed to quantify the improvement precisely because there are other
glitchy things going on that interfere with accurate measurement. So now
we get to...
Definitely if you get a chance to show how much just the lookup optimization
helps, I'd like to know. I'll also try to gather some numbers.
...look at real time for the untar and sync!
Indeed. The real time numbers are certainly confusing. I actually saw real
time decrease on most of my tests (usually I hack things to increase the
hash allocation to something like a few pages). I want to do some more
consecutive untars though.
Ocfs2 sometimes sits and gazes at its navel for minutes at a time, doing
nothing at all. Timer bug? A quick glance at SysRq-t shows ocfs2 waiting
in io_schedule. Waiting for io that never happens? This needs more
investigation.
Did you notice this during the untar? If not, do you have any reproducible
test case?
Delete performance remains horrible, even with a 256 meg journal[4] which
is unconscionably big anyway. Compare to ext3, which deletes kernel trees
at a steady 2 seconds per, with a much smaller journal. Ocfs2 takes more
like a minute.
This doesn't surprise me - our unlink performance leaves much to be desired
at the moment. How many nodes did you have mounted when you ran that test?
Off the top of my head,the two things which I would guess are hurting delete
the most right now are node messaging and lack of directory read ahead. The
first will be solved by more intelligent use of the DLM so that the network
will only be hit for those nodes that actually care about a given inode --
unlink rename and mount/unmount are the only things left that still use the
slower 'vote' mechanism. Directory readahead is much easier to solve, it's
just that nobody has gotten around to fixing that yet :/ I bet there's more
to figure out with respect to unlink performance.
[2] It is high time we pried loose the ocfs2 design process from secret
irc channels and dark tunnels running deep beneath Oracle headquarters,
and started sharing the squishy goodness of filesystem clustering
development with some more of the smart people lurking here.
We're not trying to hide anything from anyone :)
I'm always happy to talk about design. We've been in bugfix (and more> recently, performance fix) mode for a while now, so there hasn't been
+ if (likely(res->lockname.name[0] != name[0]))
Is the likely() here necessary? It actually surprises me that the check even
helps so much - if you check the way OCFS2 lock names are built, the first
few bytes are very regular - the first char is going to almost always be one
of M, D, or W. Hmm, I guess if you're hitting it 1/3 fewer times...
+ continue;
+ if (likely(res->lockname.len != len))
Also here, lengths of OCFS2 locks are also actually fairly regular, so
perhaps the likely() isn't right?