Re: Issues with AMD microcode updates

From: Henrique de Moraes Holschuh
Date: Fri Sep 27 2013 - 15:37:11 EST


On Thu, 26 Sep 2013, Sherry Hurwitz wrote:
> We have failed to reproduce a hang while loading microcode.

I got an offer from a Debian user to test it over the weekend, let's hope
he will have more luck(?) at hitting the issue. If he does, it should give
us sysrq+t dumps of the hung system.

> We have tested with kernel and AMD family combinations with
> normal and error condition so error paths were taken. Obviously
> there are factors we are missing that the users are hitting.

Yeah, and it is not likely to be a kernel patch, as the users hit the issue
using non-distro kernels :-(

Maybe it is on the firmware-loader side, but one user did wait 1 hour for
the thing to get unstuck, and that would have taken care of any possible
firmware-loader timeouts.

> Any suggestions on how we improve the test matrix would be
> helpful. We will continue the investigation but any insights are appreciated.
>
> NOTE: kernels before 3.0 only load 1 (2k) size of microcode patch and
> therefore do not support microcode loading of family 14h, 15h, and 16h.
> Also,in a test request on another thread you suggested someone with
> family 15h revC0 to load microcode twice with an earlier patch and then
> the latest, but there has only been 1 microcode patch level published for revB2
> so that test won't work.

Well, it is the only thing I could think of, other than some nasty race
condition...

> kernel cpu family results conditions
> ---------------------------------------------------------------------------------
> 2.6.38 fam10h load passed normal
> 2.6.38 fam15h revC0 load failed 2.6.38 can not handle 4k patches
> 3.5.2 fam10h load passed normal
> 3.5.2 fam15h revB2 load passed loaded 637 then second load 63d
> 3.5.2 fam15h revC0 load passed normal
> 3.5.2 fam15h revC0 load failed used a corrupted bin file

I just looked, and the 2.6.38 hang happened for i686 and an unindentified
3-core AMD processor, and the 3.5.2 on x86-64 PREEMPT, on a fam15h model 2
stepping 0, 32-core AMD processor (Linux 3.5.2 (SMP w/32 CPU cores;
PREEMPT)). No patterns there.

BTW, the userspace script that users reported to have hung is this:

grep -q "^vendor_id[[:blank:]]*:[[:blank:]]*.*AuthenticAMD" /proc/cpuinfo && {
if modprobe -q --first-time microcode ; then
echo "Updating microcode on all online processors..." >&2
else
# we have to trigger the microcode update manually
if [ -e /sys/devices/system/cpu/microcode/reload ] ; then
echo "Updating microcode on all online processors..." >&2
echo 1 > /sys/devices/system/cpu/microcode/reload || {
echo "Kernel reported failure while updating microcode!" >&2
}
else
# Try all online processors, broken kernels need this,
# fixed kernels will accept it only on the BSP and update
# all processors anyway, and -EINVAL all others... but we
# don't know which one is the BSP, so we try all of them
# and hide errors, the kernel will log any real problem.
echo "Using per-core interface to update microcode on online processors..." >&2
find /sys/devices/system/cpu -noleaf -type f -path '/sys/devices/system/cpu/cpu*/microcode/reload' | \
while read i ; do echo -n 1 2>/dev/null >"$i" || true ; done
fi
fi
}


With the microcode driver already loaded (so, that modprobe line fails).

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/