Taylor Swope describes finding the cause of a rare bug in the game The Outer Worlds in this Twitter thread. Since games are a visual simulation, sometimes you get to see the bug, like you can here. In the software I work on, success and failure are usually text written to a log somewhere. Despite having to imagine the consequences, the principles are the same and so this story felt familiar. It demonstrates that for really hard bugs, you need to collect as much "nearby" information as you can, so you can find the "unexpected" interactions that are causing the problem. For example, in this story, without having investigating cases where characters might be falling, the author may not have realized that "climbing nothing" was a related problem.
This is a great latency debugging story, with a good demonstration of how to attack a really hard problem: first, reproduce the issue. Then, reduce the reproduction down as much as you can. Finally, you'll have to dig into what is left until you understand the root cause. In this particular case, it starts with mysterious slow network requests, and ends with a bug in the Linux kernel. This article discusses a ton of low-level details I didn't really follow, but in the middle is a great description of how the Linux kernel delivers packets to applications, with some great diagrams of how it can go wrong.
The author, Frank McSherry, sent this to me in response to my post about ordered versus unordered indexes. This post is fun, but I think the real insight is buried way at the bottom, after all the stuff about graph processing: sequential IO is substantially faster than random IO, even in memory. The post includes an example where it is faster to sort then process, rather than directly try to process the data, even if the data fits in memory. The other interesting point is that sorting can transform random access into sequential access in some cases. This is a great point, and it made me realize that I did not mention locality in my post about why databases prefer the sorted indexes.
This article divides tests into two categories: correctness and engineering productivity. I think this is a fairly novel and useful model for thinking about tests. It helps explain why very primitive "change detector" tests that just verify that the output does not change, can be effective: because they help eliminate some variables or uncertainty when changing a large piece of software. They are not effective at verifying that the software actually "works", because that is not the goal. Nelson previously wrote two other articles about testing that I refer to every now and then: How I Write Tests and Design for Testability. This reminds me that I should revisit my idea for a "Philosophy of Software Testing" book, since I have yet to find a book on testing I can recommend to people who want to improve at this critical software engineering skill.