Re: 2.6.17-mm6

From: Andi Kleen
Date: Thu Jul 06 2006 - 13:16:06 EST


ebiederm@xxxxxxxxxxxx (Eric W. Biederman) writes:

> Andrew Morton <akpm@xxxxxxxx> writes:
>
> > On Wed, 5 Jul 2006 17:05:49 -0700
> > "Keith Mannthey" <kmannth@xxxxxxxxx> wrote:
> >
> >> On 7/5/06, Andrew Morton <akpm@xxxxxxxx> wrote:
> >> > On Wed, 5 Jul 2006 16:44:57 -0700
> >> > Andrew Morton <akpm@xxxxxxxx> wrote:
> >> >
> >> > > I guess a medium-term fix would be to add a boot parameter to override
> >> > > PERCPU_ENOUGH_ROOM - it's hard to justify increasing it permanently just
> >> > > for the benefit of the tiny minority of kernels which are hand-built with
> >> > > lots of drivers in vmlinux.
> >>
> >> I am not really loading alot of drivers. I am building with a ton of driver.
> >> >
> >> > That's not right, is it. PERCPU_ENOUGH_ROOM covers vmlinux and all loaded
> >> > modules, so if vmlinux blows it all then `modprobe the-same-stuff' will
> >> > blow it as well.
> >> >
> >> > > But first let's find out where it all went.
> >> >
> >> > I agree with that person.
> >> :)
> >>
> >> This is what I get it is diffrent that yours for sure. I am a little
> >> confused by the large offset change near the start.....?
> >
> > Yes, I had an unexplained 8k hole. That was i386. Your x86_64 output
> > looks OK though.
> >
> >> elm3a153:/home/keith/linux-2.6.17-mm6-orig # nm -n vmlinux | grep per_cpu
> >>...
> >> ffffffff80658000 A __per_cpu_start It starts here
> >> ffffffff80658000 D per_cpu__init_tss 8k
> >> ffffffff8065a080 d per_cpu__idle_state
> >> ffffffff8065a084 d per_cpu__cpu_idle_state
> >> ffffffff8065a0a0 D per_cpu__vector_irq
> >> ffffffff8065a4a0 D per_cpu__device_mce
> >> ffffffff8065a518 d per_cpu__next_check
> >> ffffffff8065a520 d per_cpu__threshold_banks
> >> ffffffff8065a550 d per_cpu__bank_map
> >> ffffffff8065a580 d per_cpu__flush_state
> >> ffffffff8065a600 D per_cpu__cpu_state
> >> ffffffff8065a620 d per_cpu__perfctr_nmi_owner
> >> ffffffff8065a624 d per_cpu__evntsel_nmi_owner
> >> ffffffff8065a640 d per_cpu__nmi_watchdog_ctlblk
> >> ffffffff8065a660 d per_cpu__last_irq_sum
> >> ffffffff8065a668 d per_cpu__alert_counter
> >> ffffffff8065a670 d per_cpu__nmi_touch
> >> ffffffff8065a680 D per_cpu__current_kprobe
> >> ffffffff8065a6a0 D per_cpu__kprobe_ctlblk
> >> ffffffff8065a7e0 D per_cpu__mmu_gathers 4k
> >> ffffffff8065b7e0 d per_cpu__runqueues 5k
> >> ffffffff8065ca60 d per_cpu__cpu_domains
> >> ffffffff8065cae0 d per_cpu__core_domains
> >> ffffffff8065cb60 d per_cpu__phys_domains
> >> ffffffff8065cbe0 d per_cpu__node_domains
> >> ffffffff8065cc60 d per_cpu__allnodes_domains
> >> ffffffff8065cce0 D per_cpu__kstat wham - 17.5k
> >> ffffffff80661120 D per_cpu__process_counts
> >> ffffffff80661130 d per_cpu__cpu_profile_hits
> >> ffffffff80661140 d per_cpu__cpu_profile_flip
> >> ffffffff80661148 d per_cpu__tasklet_vec
> >> ffffffff80661150 d per_cpu__tasklet_hi_vec
> >> ffffffff80661158 d per_cpu__ksoftirqd
> >> ffffffff80661160 d per_cpu__tvec_bases
> >> ffffffff80661180 D per_cpu__rcu_data
> >> ffffffff80661200 D per_cpu__rcu_bh_data
> >> ffffffff80661280 d per_cpu__rcu_tasklet
> >> ffffffff806612c0 d per_cpu__hrtimer_bases
> >> ffffffff80661340 d per_cpu__kprobe_instance
> >> ffffffff80661348 d per_cpu__taskstats_seqnum
> >> ffffffff80661360 d per_cpu__ratelimits.18857
> >> ffffffff80661380 d per_cpu__committed_space
> >> ffffffff806613a0 d per_cpu__lru_add_pvecs
> >> ffffffff80661420 d per_cpu__lru_add_active_pvecs
> >> ffffffff806614a0 d per_cpu__lru_add_tail_pvecs
> >> ffffffff80661520 D per_cpu__vm_event_states
> >> ffffffff80661640 d per_cpu__reap_work
> >> ffffffff806616a0 d per_cpu__reap_node
> >> ffffffff806616c0 d per_cpu__bh_lrus
> >> ffffffff80661700 d per_cpu__bh_accounting
> >> ffffffff80661720 d per_cpu__fdtable_defer_list
> >> ffffffff806617c0 d per_cpu__blk_cpu_done
> >> ffffffff806617e0 D per_cpu__radix_tree_preloads
> >> ffffffff80661860 d per_cpu__trickle_count
> >> ffffffff80661864 d per_cpu__proc_event_counts
> >> ffffffff80661880 d per_cpu__loopback_stats
> >> ffffffff80661980 d per_cpu__sockets_in_use
> >> ffffffff80661a00 D per_cpu__softnet_data
> >> ffffffff80662000 D per_cpu__netdev_rx_stat
> >> ffffffff80662010 d per_cpu__net_rand_state
> >> ffffffff80662080 d per_cpu__flow_tables
> >> ffffffff80662100 d per_cpu__flow_hash_info
> >> ffffffff80662180 d per_cpu__flow_flush_tasklets
> >> ffffffff806621c0 d per_cpu__rt_cache_stat
> >> ffffffff80662200 d per_cpu____icmp_socket
> >> ffffffff80662208 A __per_cpu_end
> >
> > So you've been hit by the expansion of NR_IRQS which bloats kernel_stat
> > which gobbles per-cpu data.
> >
> > In 2.6.17 NR_IRQS is 244. In -mm (due to the x86_64 genirq conversion)
> > NR_IRQS became (256 + 32 * NR_CPUS). Hence the kstat "array" became
> > two-dimensional. It's now O(NR_CPUS^2).
>
> Hmm. Perhaps it is a good thing I didn't push it to it's limit
> of 244*NR_CPUS.

Clearly it just needs to be dynamically allocated and only
for interrupts that are actually used.

e.g. a 2 level array scheme?

Just need to make sure the memory overhead for small systems stays reasonable.

>
> > I don't know what's a sane max for NR_CPUS on x86_64, but that'll sure be a
> > showstopper if the ia64 guys try the same trick.
>
> I think the unisys boxes are about 128 sockets.

64 I thought. Newisys also plans 64 socket boxes.

> So 128 or 256 is the
> max at the moment. I'm not certain where things are heading. Big
> specialized boxes seem to be giving way to smaller systems. Yet the
> amount that you can do with commodity hardware is getting larger,
> so I suspect it will likely be a wash. The worst case scenario
> is that sgi will build a chipset supporting 32K cpus that works with
> both Xeons and Itaniums.

Currently x86 has a 8 bit APIC limit. So 255 is max.

But at some point I expect it will be broken, not with sockets
but with multicores and threads.

On the other hand it might be reasonable to size the NR_CPUs
only by sockets. The would lower the memory requirements.
>
> I'm tempted to suggest adding the following to irq_desc.
> unsigned int last_cpu_count;
> unsigned int last_cpu;

Add all the statistics to irq_desc?

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/