[That's like] Me, with zero C/C++ experience, being asked to figure out why the newer version of the Linux kernel is randomly crash-panicking after getting cross-compiled for a custom hardware box.
("He's familiar with the the build-system scripts, so he can see what changed.")
-----
I spent weeks of testing slightly different code-versions, different compile settings, different kconfig options, knocking out particular drivers, waiting for recompiles and walking back and forth to reboot the machine, and generally puzzling over extremely obscure and shifting error traces... And guess what? The new kernel was fine.
What was not fine were some long-standing hexadecimal arguments to the hypervisor, which had been memory-corrupting a spot in all kernels we'd ever loaded. It just happened to be that the newer compiles shifted bytes around so that something very important was in the blast zone.
Anyway, that's how 3 weeks of frustrating work can turn into a 2-character change.
Combination PIT/serial interrupt issue involving microsecond-resolution system programmable interval timer and multi-port serial driver. Would crash every day or so.
Had to create a stress test to reproduce in minutes not days. Then trace code paths through timers and serial events to find problematical path. Turned out to have many - timer interrupt callback could cancel interrupt, reschedule timer, change interval, cancel then reschedule. All in the presence of other channel interrupts occurring and overlapping unpredictably. Timers rescheduled for intervals that had passed already once the callback completed. And on and on.
Took a weekend alone with the code and a set of machines, desk-time getting my head around it all, then coding bullet-proof paths for all calls and callbacks for every related system call.
Once it worked, it worked for days then months under test. Nothing is too hard to resist a methodical approach.
Upgrading from QT4 to 5 broke the appending of QStrings to QByteArrays such that it stored half the data from a QString (some wonkiness with UTF8 and UTF16 IIRC). Took a rewrite of the RTMP/AMF layer in the codebase to figure it out.
Recently, this one which I'm still investigating -- if you want to help :) https://github.com/anza-xyz/agave/pull/4585
Recycling a comment, where part of the annoyance came from the feeling that they should have been asking someone else to solve it: https://news.ycombinator.com/item?id=37859771
_____
[That's like] Me, with zero C/C++ experience, being asked to figure out why the newer version of the Linux kernel is randomly crash-panicking after getting cross-compiled for a custom hardware box.
("He's familiar with the the build-system scripts, so he can see what changed.")
-----
I spent weeks of testing slightly different code-versions, different compile settings, different kconfig options, knocking out particular drivers, waiting for recompiles and walking back and forth to reboot the machine, and generally puzzling over extremely obscure and shifting error traces... And guess what? The new kernel was fine.
What was not fine were some long-standing hexadecimal arguments to the hypervisor, which had been memory-corrupting a spot in all kernels we'd ever loaded. It just happened to be that the newer compiles shifted bytes around so that something very important was in the blast zone.
Anyway, that's how 3 weeks of frustrating work can turn into a 2-character change.
Combination PIT/serial interrupt issue involving microsecond-resolution system programmable interval timer and multi-port serial driver. Would crash every day or so.
Had to create a stress test to reproduce in minutes not days. Then trace code paths through timers and serial events to find problematical path. Turned out to have many - timer interrupt callback could cancel interrupt, reschedule timer, change interval, cancel then reschedule. All in the presence of other channel interrupts occurring and overlapping unpredictably. Timers rescheduled for intervals that had passed already once the callback completed. And on and on.
Took a weekend alone with the code and a set of machines, desk-time getting my head around it all, then coding bullet-proof paths for all calls and callbacks for every related system call.
Once it worked, it worked for days then months under test. Nothing is too hard to resist a methodical approach.
Upgrading from QT4 to 5 broke the appending of QStrings to QByteArrays such that it stored half the data from a QString (some wonkiness with UTF8 and UTF16 IIRC). Took a rewrite of the RTMP/AMF layer in the codebase to figure it out.
Any flaky selenium test.
rendering corruption issue or perf issue of wayland that involves 100 processes.