Re: [RFC 3/3] SGI Altix cross partition memory (XPMEM)

From: Dean Nelson
Date: Wed Aug 22 2007 - 13:00:33 EST


On Thu, Aug 09, 2007 at 11:15:42PM -0700, Andrew Morton wrote:
> On Thu, 9 Aug 2007 20:14:35 -0500 Dean Nelson <dcn@xxxxxxx> wrote:
>
> > This patch provides cross partition access to user memory (XPMEM) when
> > running multiple partitions on a single SGI Altix.
> >
>
> - Please don't send multiple patches all with the same title.

Sorry, I'd somehow gotten the impression it was to be done this way.

> - Feed the diff through checkpatch.pl, ponder the result.

I did and made the necessary changes to satisfy the issues it raised, except
for:

1) ERROR: Don't use kernel_thread(): see Documentation/feature-removal-schedule.txt

As mentioned in the intro to this set of patches, XPMEM is not currently
using the kthread API because it was in the process of being changed to
require a kthread_stop() be done for every kthread_create(), and the
kthread_stop() couldn't be called for a thread that had already exited.
In talking with Eric Biederman, there was some talk of creating a
kthread_orphan() which would eliminate the need for a call to
kthread_stop() being required.

2) WARNING: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt

I eliminated all but two usages of volatile in XPMEM. One in
xpmem_flush_cpu_cache() and the other in xpmem_recall_cache_lines_phys().
Without the volatile the compiler eliminates the necessary instruction in
each. Any suggestions as how to eliminate the need for volatile in these
two cases?

3) WARNING: declaring multiple variables together should be avoided

checkpatch.pl is erroneously commplaining about the following found in five
different functions in arch/ia64/sn/kernel/xpmem_pfn.c.

int n_pgs = xpmem_num_of_pages(vaddr, size);

> ! Avoid needless casts to and from void* (eg, vm_private_data)

Done.

> ! The test for PF_DUMPCORE in xpmem_fault_handler() is mysterious and
> merits a comment.

Done.

> - xpmem_fault_handler() is scary.

You are very right about that, but consider that this code has been running
at our customer sites on very large systems (18 partitions with 512 CPUs in
each partition). The avoid_deadlock_1 and avoid_deadlock_2 issues were
discovered at such sites and have been resolved (to the best of our
knowledge).

> - xpmem_fault_handler() appears to have imposed a kernel-wide rule that
> when taking multiple mmap_sems, one should take the lowest-addressed one
> first? If so, that probably wants a mention in that locking comment in
> filemap.c

Sure. After looking at the lock ordering comment block in mm/filemap.c, it
wasn't clear to me how best to document this. Any suggestions/help would
be most appreciated.

> - xpmem_fault_handler() does atomic_dec(&seg_tg->mm->mm_users). What
> happens if that was the last reference?

When /dev/xpmem is opened by a user process, xpmem_open() incs mm_users
and when it is flushed, xpmem_flush() decs it (via mmput()) after having
ensured that no XPMEM attachments exist of this mm. Thus the dec in
xpmem_fault_handler() will never take it to 0.

> - Has it all been tested with lockdep enabled? Jugding from all the use
> of SPIN_LOCK_UNLOCKED, it has not.
>
> Oh, ia64 doesn't implement lockdep. For this code, that is deeply
> regrettable.

No, it hasn't been tested with lockdep. But I have switched it from using
SPIN_LOCK_UNLOCKED to spin_lock_init().

> ! This code all predates the nopage->fault conversion and won't work in
> current kernels.

I've switched from using nopage to using fault. I read that it is intended
that nopfn also goes away. If this is the case, then the BUG_ON if VM_PFNMAP
is set would make __do_fault() a rather unfriendly replacement for do_no_pfn().

> - xpmem_attach() does smp_processor_id() in preemptible code. Lucky that
> ia64 doesn't do preempt?

Actually, the code is fine as is even with preemption configured on. All it's
doing is ensuring that the thread was previously pinned to the CPU it's
currently running on. If it is, it can't be moved to another CPU via
preemption, and if it isn't, the check will fail and we'll return -EINVAL
and all is well.

> - Stuff like this:
>
> + ap_tg = xpmem_tg_ref_by_apid(apid);
> + if (IS_ERR(ap_tg))
> + return PTR_ERR(ap_tg);
> +
> + ap = xpmem_ap_ref_by_apid(ap_tg, apid);
> + if (IS_ERR(ap)) {
> + xpmem_tg_deref(ap_tg);
> + return PTR_ERR(ap);
> + }
>
> is fragile. It is easy to introduce leaks and locking errors as the
> code evolves. The code has a lot of these deeply-embedded `return'
> statments preceded by duplicated unwinding. Kenrel code generally
> prefers to do the `goto out' thing.

Done (assuming I correctly interpreted your intent).

> ! "XPMEM_FLAG_VALIDPTEs"? Someone's pinky got tired at the end ;)

Yeah, but it's better now (and so is the code).

> Attention span expired at 19%, sorry. It's a large patch.

I've included the new patch as an attachment.

Thanks,
Dean

Attachment: xpmem-module.v002.bz2
Description: BZip2 compressed data