Re: [NFS] [PATCH 010 of 11] knfsd: make rpc threads pools numaaware

From: Greg Banks
Date: Tue Aug 01 2006 - 01:20:36 EST


On Tue, 2006-08-01 at 14:43, Andrew Morton wrote:
> On Mon, 31 Jul 2006 15:54:57 +1000
> Greg Banks <gnb@xxxxxxxxxxxxxxxxx> wrote:
>
> > On Mon, 2006-07-31 at 14:42, Greg Banks wrote:
> > > On Mon, 2006-07-31 at 14:36, Neil Brown wrote:
> > > > On Sunday July 30, akpm@xxxxxxxx wrote:
> > > > > On Mon, 31 Jul 2006 10:42:34 +1000
> > > > > NeilBrown <neilb@xxxxxxx> wrote:
> > > > >
> > > > > > +static int
> > > > > > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > > > > > +{
> > > > > > + unsigned int maxpools = num_possible_cpus();
> > > > > > + unsigned int pidx = 0;
> > > > > > + unsigned int cpu;
> > > > > > + int err;
> > > > > > +
> > > > > > +
> > > > >
> > > > > That isn't right - it assumes that cpu_possible_map is not sparse. If it
> > > > > is sparse, we allocate undersized pools and then overindex them.
> > >
> > > Umm, I think Andrew's right, num_possible_cpus() should be NR_CPUS.
> >
> > How about this version of the patch? It replaces num_possible_cpus()
> > with highest_possible_processor_id()+1 and similarly for nodes.
> > --
> >
> > knfsd: Actually implement multiple pools. On NUMA machines, allocate
> > a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> > global pool. Enqueue sockets on the svc_pool corresponding to the CPU
> > on which the socket bh is run (i.e. the NIC interrupt CPU). Threads
> > have their cpu mask set to limit them to the CPUs in the svc_pool that
> > owns them.
> >
> > This is the patch that allows an Altix to scale NFS traffic linearly
> > beyond 4 CPUs and 4 NICs.
> >
> > Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> > Christoph Hellwig and Andrew Morton.
> >
>
> Something has gone rather wrong here.
>
> > - serv = __svc_create(prog, bufsize, shutdown, /*npools*/1);
> > + serv = __svc_create(prog, bufsize, shutdown, npools);
>
> __svc_create() is:
>
> __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> void (*shutdown)(struct svc_serv *serv))
>
> so heaven knows what tree you're patching.

Hmm. Neil introduced the `shutdown' argument a few days ago, and I
thought I had his patches in my quilt tree, but apparently I screwed
something up. Sorry about that.

> Incremental patches really are preferred. So we can see what people are
> monkeying with ;)

Righto.

> After fixing the rejects and cleaning a few things up, your proposed change
> amounts to:
>
> --- a/net/sunrpc/svc.c~knfsd-make-rpc-threads-pools-numa-aware-fix
> +++ a/net/sunrpc/svc.c
> @@ -116,7 +116,7 @@ fail:
> static int
> svc_pool_map_init_percpu(struct svc_pool_map *m)
> {
> - unsigned int maxpools = num_possible_cpus();
> + unsigned int maxpools = highest_possible_processor_id() + 1;
> unsigned int pidx = 0;
> unsigned int cpu;
> int err;
> @@ -136,6 +136,18 @@ svc_pool_map_init_percpu(struct svc_pool
> return pidx;
> };
>
> +static int
> +highest_possible_node_id(void)
> +{
> + unsigned int node;
> + unsigned int highest = 0;
> +
> + for_each_node(node)
> + highest = node;
> +
> + return highest;
> +}
> +
>
> /*
> * Initialise the pool map for SVC_POOL_PERNODE mode.
> @@ -144,7 +156,7 @@ svc_pool_map_init_percpu(struct svc_pool
> static int
> svc_pool_map_init_pernode(struct svc_pool_map *m)
> {
> - unsigned int maxpools = num_possible_nodes();
> + unsigned int maxpools = highest_possible_node_id() + 1;
> unsigned int pidx = 0;
> unsigned int node;
> int err;
> _
>
>

Yes.

> Which shouldn't have compiled, due to the missing forward declaration.

? It compiled, booted, and ran traffic.

gnb@chook 1965> quilt push 2
Applying patch knfsd-make-pools-numa-aware-5
patching file net/sunrpc/svc.c
patching file net/sunrpc/svcsock.c
patching file include/linux/sunrpc/svc.h

Applying patch knfsd-allow-admin-to-set-nthreads-per-node-2
patching file fs/nfsd/nfsctl.c
patching file fs/nfsd/nfssvc.c

Now at patch knfsd-allow-admin-to-set-nthreads-per-node-2

gnb@chook 1966> make compressed modules
CHK include/linux/version.h
CHK include/linux/utsrelease.h
CHK include/linux/compile.h
CC arch/ia64/ia32/sys_ia32.o
LD arch/ia64/ia32/built-in.o
CC fs/compat.o
CC fs/nfsctl.o
CC [M] fs/lockd/clntlock.o
CC [M] fs/lockd/clntproc.o
CC [M] fs/lockd/host.o
CC [M] fs/lockd/svc.o
CC [M] fs/lockd/svclock.o
CC [M] fs/lockd/svcshare.o
CC [M] fs/lockd/svcproc.o
CC [M] fs/lockd/svcsubs.o
CC [M] fs/lockd/mon.o
CC [M] fs/lockd/xdr.o
CC [M] fs/lockd/xdr4.o
CC [M] fs/lockd/svc4proc.o
LD [M] fs/lockd/lockd.o
CC [M] fs/nfs/callback.o
CC [M] fs/nfs/callback_xdr.o
LD [M] fs/nfs/nfs.o
CC [M] fs/nfsd/nfssvc.o
CC [M] fs/nfsd/nfsctl.o
CC [M] fs/nfsd/nfsproc.o
CC [M] fs/nfsd/nfsfh.o
CC [M] fs/nfsd/vfs.o
CC [M] fs/nfsd/export.o
CC [M] fs/nfsd/auth.o
CC [M] fs/nfsd/lockd.o
CC [M] fs/nfsd/nfscache.o
CC [M] fs/nfsd/nfsxdr.o
CC [M] fs/nfsd/stats.o
CC [M] fs/nfsd/nfs2acl.o
CC [M] fs/nfsd/nfs3proc.o
CC [M] fs/nfsd/nfs3xdr.o
CC [M] fs/nfsd/nfs3acl.o
CC [M] fs/nfsd/nfs4proc.o
CC [M] fs/nfsd/nfs4xdr.o
CC [M] fs/nfsd/nfs4state.o
CC [M] fs/nfsd/nfs4idmap.o
CC [M] fs/nfsd/nfs4callback.o
CC [M] fs/nfsd/nfs4recover.o
LD [M] fs/nfsd/nfsd.o
LD fs/built-in.o
CC [M] net/sunrpc/svc.o
CC [M] net/sunrpc/svcsock.o
CC [M] net/sunrpc/svcauth.o
CC [M] net/sunrpc/svcauth_unix.o
CC [M] net/sunrpc/sunrpc_syms.o
CC [M] net/sunrpc/stats.o
LD [M] net/sunrpc/sunrpc.o
CC [M] net/sunrpc/auth_gss/auth_gss.o
CC [M] net/sunrpc/auth_gss/gss_mech_switch.o
CC [M] net/sunrpc/auth_gss/svcauth_gss.o
CC [M] net/sunrpc/auth_gss/gss_krb5_crypto.o
CC [M] net/sunrpc/auth_gss/gss_krb5_mech.o
CC [M] net/sunrpc/auth_gss/gss_krb5_seal.o
CC [M] net/sunrpc/auth_gss/gss_krb5_unseal.o
CC [M] net/sunrpc/auth_gss/gss_krb5_seqnum.o
CC [M] net/sunrpc/auth_gss/gss_krb5_wrap.o
CC [M] net/sunrpc/auth_gss/gss_spkm3_mech.o
CC [M] net/sunrpc/auth_gss/gss_spkm3_seal.o
CC [M] net/sunrpc/auth_gss/gss_spkm3_unseal.o
CC [M] net/sunrpc/auth_gss/gss_spkm3_token.o
LD [M] net/sunrpc/auth_gss/auth_rpcgss.o
LD [M] net/sunrpc/auth_gss/rpcsec_gss_krb5.o
LD [M] net/sunrpc/auth_gss/rpcsec_gss_spkm3.o
GEN .version
CHK include/linux/compile.h
UPD include/linux/compile.h
CC init/version.o
LD init/built-in.o
LD .tmp_vmlinux1
KSYM .tmp_kallsyms1.S
AS .tmp_kallsyms1.o
LD .tmp_vmlinux2
KSYM .tmp_kallsyms2.S
AS .tmp_kallsyms2.o
LD vmlinux
SYSMAP System.map
SYSMAP .tmp_System.map
OBJCOPY arch/ia64/hp/sim/boot/vmlinux.bin
GZIP arch/ia64/hp/sim/boot/vmlinux.gz
LN vmlinux.gz
Kernel: vmlinux.gz is ready
Building modules, stage 2.
MODPOST
WARNING: drivers/atm/fore_200e.o - Section mismatch: reference to .init.data:_fore200e_pca_fw_data from .data.rel.ro between 'fore200e_bus' (at offset 0x20) and 'fore200e_ops'
WARNING: drivers/atm/fore_200e.o - Section mismatch: reference to .init.data:_fore200e_pca_fw_size from .data.rel.ro between 'fore200e_bus' (at offset 0x28) and 'fore200e_ops'
WARNING: drivers/atm/fore_200e.o - Section mismatch: reference to .init.text:fore200e_pca_prom_read from .data.rel.ro between 'fore200e_bus' (at offset 0x90) and 'fore200e_ops'
WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text:he_init_irq from .text between 'he_start' (at offset 0x2ca2) and 'he_irq_handler'
WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text:he_init_cs_block_rcm from .text between 'he_start' (at offset 0x81c2) and 'he_irq_handler'WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text: from .text between 'he_start' (at offset 0x81e2) and 'he_irq_handler'
WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text:he_init_group from .text between 'he_start' (at offset 0x82a2) and 'he_irq_handler'
WARNING: drivers/block/cpqarray.o - Section mismatch: reference to .init.text:cpqarray_init_one from .data.rel.local between 'cpqarray_pci_driver' (at offset 0x20) and 'smart1_access'
WARNING: drivers/net/dgrs.o - Section mismatch: reference to .init.text:dgrs_pci_probe from .data.rel.local after 'dgrs_pci_driver' (at offset 0x20)
WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to .init.text:de_init_one from .data.rel.local after 'de_driver' (at offset 0x20)
WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to .exit.text:de_remove_one from .data.rel.local after 'de_driver' (at offset 0x28)
WARNING: drivers/usb/host/isp116x-hcd.o - Section mismatch: reference to .init.text:isp116x_probe from .data.rel.local between 'isp116x_driver' (at offset 0x0) and 'isp116x_hc_driver'
WARNING: drivers/video/aty/atyfb.o - Section mismatch: reference to .init.text:aty_get_pll_ct from .data.rel.ro after 'aty_pll_ct' (at offset 0x18)
WARNING: drivers/video/aty/atyfb.o - Section mismatch: reference to .init.text:aty_init_pll_ct from .data.rel.ro after 'aty_pll_ct' (at offset 0x20)
WARNING: drivers/video/nvidia/nvidiafb.o - Section mismatch: reference to .exit.text:nvidiafb_remove from .data.rel.local after 'nvidiafb_driver' (at offset 0x28)
WARNING: drivers/video/riva/rivafb.o - Section mismatch: reference to .exit.text:rivafb_remove from .data.rel.local after 'rivafb_driver' (at offset 0x28)
WARNING: sound/drivers/snd-dummy.o - Section mismatch: reference to .init.text:snd_dummy_probe from .data.rel.local between 'snd_dummy_driver' (at offset 0x0) and 'snd_dummy_controls'
WARNING: sound/drivers/snd-mtpav.o - Section mismatch: reference to .init.text:snd_mtpav_probe from .data.rel.local between 'snd_mtpav_driver' (at offset 0x0) and 'snd_mtpav_input'
WARNING: sound/drivers/snd-serial-u16550.o - Section mismatch: reference to .init.text:snd_serial_probe from .data.rel.local between 'snd_serial_driver' (at offset 0x0) and 'ops.15504'
WARNING: sound/drivers/snd-virmidi.o - Section mismatch: reference to .init.text:snd_virmidi_probe from .data.rel.local after 'snd_virmidi_driver' (at offset 0x0)
LD [M] fs/lockd/lockd.ko
LD [M] fs/nfs/nfs.ko
LD [M] fs/nfsd/nfsd.ko
LD [M] net/sunrpc/auth_gss/auth_rpcgss.ko
LD [M] net/sunrpc/auth_gss/rpcsec_gss_krb5.ko
LD [M] net/sunrpc/auth_gss/rpcsec_gss_spkm3.ko
CC net/sunrpc/sunrpc.mod.o
LD [M] net/sunrpc/sunrpc.ko


gnb@chook 1967> gcc --version
gcc (GCC) 4.1.0 (SUSE Linux)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

> And
> I'd be surprised if it worked very well with CONFIG_NUMA=n.

Why? The only new code which is sensitive to CONFIG_NUMA is
for_each_node(node) and that has a definition for !NUMA that
appears (by inspection) to do the right thing.

> And it's
> naughty to be sneaking general library functions into the sunrpc code
> anyway.

Fair enough, my bad.

>
> Please,
>
> - Write a standalone patch which adds highest_possible_node_id() to
> lib/cpumask.c(?)
>
> Make sure it's inside #ifdef CONFIG_NUMA
>
> Remember to export it to modules.
>
> Provide a !CONFIG_NUMA version in include/linux/nodemask.h which just
> returns constant zero.

Will do.

> Consider doing something more efficient than the for_each_node() loop.
> Although I'm not sure what that would be, given that we don't have
> find_last_bit().

That code is a copy-n-paste of highest_possible_processor_id().

> - Provide an incremental patch against
> knfsd-make-rpc-threads-pools-numa-aware.patch which utilises
> highest_possible_node_id().
>
> A replacement patch will be grudgingly accepted, but I'll only go and
> turn it into an incremental one, so you can't hide ;)

Ok, an incremental patch it is then. Let it not be said I
don't take a hint the fourth time.

> - Test it real good. Modular, non-modular, NUMA, non-NUMA, !SMP.
>

Right.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/