Re: On disabling AGP without working alternative (PCI fallback is broken for years)

From: Christian König
Date: Mon Nov 09 2020 - 08:57:57 EST


Hi Thomas,

Am 09.11.20 um 12:40 schrieb Thomas “illwieckz“ Debesse:
Hi, on May 12 2020, a commit (ba806f9) was merged disabling AGP
in default build.

It was signed-off by Christian König and Reviewed by Alex Deucher.
Distributions started to backport this commit, and it seems to have
happened with 5.4.0-48-generic on Ubuntu 20.04 LTS side, which was
built on Sep 10 2020.

Around that time I noticed AGP computers experiencing lock-ups and
other problems making them unusable after the upgrade. After
investigating what was happening bisecting Linux versions,
I reverted the commit and those computers were working again.

Commit message was:

This means a performance regression for some GPUs,
but also a bug fix for some others.
Unfortunately, this commit does not only introduce a performance
regression but makes some computers unusable, maybe all computers
with AMD CPUs.

One of the root cause may be that PCI GPUs are broken for years on
AMD platforms, it was tested and verified on:

- K8-based computer with AGP
- K8-based computer with PCI Express
- K10-based computer with AGP
- Piledriver-based computer with PCI Express

That is interesting but doesn't make much sense from the technical perspective.

See AGP is build on top of PCI, if PCI doesn't work AGP won't work either. So why should AGP work while PCI doesn't?

If I'm not completely mistaken I should have a system from that time somewhere here.

The breakage was tested and reproduced from Linux 4.4 to Linux
5.10-rc2 (I have not tried older than 4.4).

PCI GPUs may be broken on some other platforms, but I have found
that testing on an Intel PC (with PCI Express) does not reproduce
the issue when the PCI GPU hardware is plugged in.

There is two patches I'm requesting comments for:

## drm/radeon: make all PCI GPUs use 32 bits DMA bit mask

https://lkml.org/lkml/2020/11/5/307

This one is not enough to fix PCI GPUs but it is enough to prevent
to fail r600_ring_test on ATI PCI devices. Note that Nvidia PCI GPUs
can't be fixed by this, and this uncovers other bug with AGP GPUs when
AGP is disabled at build time. Also, this patch may makes PCI GPUs
working on a non-optimal way on platform that accepts them with 40-bit
DMA bit mask (like Intel-based computers that already work without any
patch).

This patch is inspired from the patch made to solve that issue from
2012 on kernel 3.5: https://bugzilla.redhat.com/show_bug.cgi?id=785375

At the time, such change may have been enough to fix the issue, it's
not true any more. More breakage may have been introduced since.

Also, maybe this patch becomes useless when other PCI bugs are fixed,
who knows? At least, this is an entry-point for investigations.

## Revert "drm/radeon: disable AGP by default"

https://lkml.org/lkml/2020/11/5/308

This is the simple fix but currently only solution to make AMD hosts
with AGP port to get a display again, as without this reverts, those
computers do not have any alternative to run a display (even not PCI
GPUs).

Well you can still use the agpmode parameter to override the default setting.

We simply don't have the time to support that older GPU and disabling AGP fixed quite a number of them.

I'm asking for comments on those patches. I may have reached my own
skill cap on kernel development anyway. I can repurpose hardware to
test any other patch and can contribute time for such testing. Unlike
AGP GPUs, PCI GPUs are hard to find, so you may appreciate the time and
availability offered.

The PCI GPU on AMD CPU issue was verified with both Nvidia
(GS 8400GS rev.2) and ATI (Radeon HD 4350) PCI GPUs, such GPU
sample not being old cards from the previous millennial but capable
ones: TeraScale RV710 architecture on ATI side and Tesla 1.0 NV98
on Nvidia side. They can both do OpenGL 3.3 and feature both
512M of VRAM. The ATI one had HDMI port, and it is known some variant
of the Nvidia one (not the one I own but same specification) had HDMI
port too.

To be honest I think we will completely drop AGP support in the next 5 years or so, this includes both Nouveau as well as Radeon based GPUs.

We simply can't invest time maintaining a technology which is deprecated for nearly 15 years now.

Regards,
Christian.


Also, fixing PCI GPUs may not be enough to fix AGP GPUs running
as PCI ones, since fixing some issues (not all) on PCI side raises
new issues with AGP GPUs running as PCI ones but not on native PCI
GPUs (see below).

Bugs aside, one thing that is important to consider against the AGP
disablement is that there is such hardware that is very capable and
not that old out there. For example the ATI Radeon HD 4670 AGP
(RV730 XT) was still sold brand new after 2010 and is a powerful
and featureful GPUs with 1GB of VRAM and HDMI port. Performance with
it is still pretty decent on competitive games. To compare with other
 open source drivers mainlined in Linux, to outperform this GPU an
 user has to get an Intel UHD 600 or an Nvidia GTX 1060 from 2016.

Also, yet another thing that is important to consider against AGP
disablement is that if PCI Express was introduced in 2004, there
was still AGP compatible hardware being designed, produced and sold
very lately, especially on AMD side. Computers with quad core 64-bit
CPUs with virtualisation, 16GB of RAM and AGPs exist, and this is
widely distributed consumer hardware, not specific esoteric hardware.

So, not only powerful AGP GPUs were still sold brand new in the current
decade, but there was also very capable computers to host them. Because
of those AGP computers, fixing PCI GPUs fallback is not a solution
because PCI fallback is not a solution.

All that range of hardware became unusable with that commit disabling
AGP, without alternative.

Not only those AGP GPUs don't work with kernel's PCI fallback, but
unplugging those AGP GPUs and plugging physical PCI-native GPUs
instead does not work.

You'll find more details about the various issues on those bugs, I've
invested multiple full time day to test and reproduce bugs on a wide
range of hardware, I've attached, quoted and commented a lot of logs:

- https://bugs.launchpad.net/bugs/1899304
AGP disablement leaves GPUs without working alternative
(PCI fallback is broken), makes very-capable ATI TeraScale GPUs
unusable
- https://bugs.launchpad.net/bugs/1902981
AGP GPUs driven as PCI ones (when AGP is disabled at kernel build
time) are known to fail on K8 and K10 platforms
- https://bugs.launchpad.net/bugs/1902795
PCI graphics broken on AMD K8/K10/Piledriver platform (while it
works on Intel) verified from Linux 4.4 to 5.10-rc2
I wish to be personally CC'ed the answers/comments posted to the list
in response to my posting.

Thank you for your attention.

-- 
Thomas “illwieckz” Debesse