Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

0x0@social.rocketsfall.net · edit-2 5 months ago

Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

Rockslide0482@discuss.tchncs.de · 5 months ago

TLDR: do memtest on your RAM

I recently had an issue for quite some time where my computer would occasionally just hard crash. When it first started happening I tried many of the common tests including memcheck but found nothing. For a while it wasnt super common so I just lived through it. I thought it was an OS thing but it occurred on a different Linux distro and even on the ancient Windows 10 install I have but rarely use. I was just about to pull the trigger on replacing mobo and maybe even CPU+RAM. Before I did that I followed someone’s suggestion to do a mem test. I could have at least sworn that I already did that and it came clean but it was an easy enough test to run, so why not.

Sure enough, found an error. I isolated the faulted DIMM, pulled it out and I haven’t had a crash since. Crazy since I’m all but certain I did both memtest from a Linux live iso and the Windows memory checking utility.

In short, test your RAM. Do multiple passes. Maybe even just try swapping out single DIMMs and running on that for a reasonable ammount of time to see if you can isolate a culprit. It was my first thought when the issue first occurred because it’s usually what causes stuff like that. When the tests came up clean originally I assumed it had to be something else. I was wrong.

0x0@social.rocketsfall.net · 5 months ago

This is what I’ll try next. I do think memory is the problem now that I’ve had a few more hours of research. Kernel 6.7 has issues with elevated RAM usage, so it’s absolutely doing something funky with memory that might be exposing underlying hardware issues. I also realized my stable kernel was a version or two away from 6.6.13 (6.6.10), so I’m running it now to see if the issue was introduced late in the 6.6 release cycle, which would be easier to bisect than 6.7.

Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

Update (01-27-2024)

List of similar issues

Patched/Unpatched 6.8rc1 attempts

Bisecting 6.6 to 6.7

The state of AMDGPU in general