Re: [PATCH 1/1] via-rhine: Fix hanging with high CPU load on low-end broads.

From: Bjarke Istrup Pedersen
Date: Wed Dec 28 2011 - 12:30:11 EST


2011/12/28 Ben Hutchings <bhutchings@xxxxxxxxxxxxxx>:
> On Wed, 2011-12-28 at 16:14 +0100, Bjarke Istrup Pedersen wrote:
>> 2011/12/28 Ben Hutchings <bhutchings@xxxxxxxxxxxxxx>:
>> > On Wed, 2011-12-28 at 12:28 +0000, Bjarke Istrup Pedersen wrote:
>> >> Working around problem causing high CPU load and hanging system when
>> >> there is alot of network trafic.
>> >>
>> >> It is kind of an ugly way to work around it, but it allows the Soekris
>> >> net5501 to have trafic between two of it's NICs without hanging so much
>> >> that the watchdog kicks in and does a hard reboot of the system.
>> >>
>> >> There is more info on the problem here:
>> >> http://http://lists.soekris.com/pipermail/soekris-tech/2010-October/016889.html
>> >>
>> >> Tested with positive results on two Soekris net5501-70 boxes.
>> >
>> > This is completely wrong.  In a UP configuration the extra spinlock
>> > calls have no effect (except perhaps a small delay).  In an SMP
>> > configuration they will cause rhine_tx() to deadlock when it also tries
>> > to acquire the spinlock.
>> >
>> > Ben.
>>
>> Okay, the Soekris net5501-70 boxes are single-core, and I haven't got
>> any SMP boxes with that nic.
>> Is there a better solution for the problem then, to avoid it hanging
>> the box on a non-smp machine with a slow (500mhz) cpu?
>
> If the system actually hangs then I assume there is some bug in the
> driver.  I would guess the actual problem is that the interrupt and NAPI
> handlers are running constantly so that user processes never run (which
> I think counts as soft lockup).
>
> If the hardware supports it, interrupt moderation may help a little by
> slightly reducing the per-packet processing cost, but it isn't a full
> solution.  Or you can use a real-time kernel, which schedules interrupt
> and NAPI handlers as tasks, and adjust priorities so that user processes
> can still run.  But that brings its own problems (including generally
> lower throughput).
>
> Ben.

That would be an option, but I don't think the hardware supports it,
and it doesn't fix the problems in the driver, just hides them.
>From what I can read in the thread I linked to in the patch, the
problem only exists in the Linux driver - *BSD isn't affected by this,
on the same hardware.

Since the hack here fixes the problem on non-smp machines, it seems
like there are some race conditions in the interrupt code in the
driver, like you mention.
I'm not well enough into this driver to be able to pinpoint what's
causing it, but if somebody else is, and got some ideas, I'll be more
than willing to test :)

/Bjarke

> --
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/