🧨 Engineering War Stories
Tales from the trenches of software development - the disasters, the heroes, and the lessons that stick with you forever.
🎯 The Killer Script
In the early days of IBM RS/6000 systems, I wrote two scripts: monitor
to gather system stats and merrimack
to collect and summarize them. A mistake in the vmstat
and iostat
parameters caused thousands of zombie monitors to spawn across the system.
This was before modern process management, before containers, before graceful shutdowns were standard practice. Sometimes the most direct solution was the right solution, even if it wasn't elegant.
Lessons for Modern Development:
- Monitor your monitoring - recursive problems compound quickly
- Simple, direct solutions often work better than complex ones
- Always have a cleanup strategy for runaway processes
- Name your scripts descriptively -
killer
was unambiguous
🌀 Hoover Eats Everything
At ADP, I built a CD-ROM mastering system for client mainframe reports. Year-end processing was buckling under leftover working files. I wrote a cleanup tool named Hoover
to remove temp files based on CD status. It worked perfectly on my development box.
But on Christmas Eve, Hoover ran wild at a live site due to a missing environment variable—executing rm -rf .
and deleting up the directory tree until it crashed the system with the infamous flashing 888 code.
I had to ship a fresh mksysb
tape overnight to recover the system. Christmas was spent rebuilding production infrastructure and writing much more paranoid shell scripts.
The Birth of Defensive Programming:
- Never trust environment variables to exist in all environments
- Always validate paths before destructive operations
- Test in environments that mirror production exactly
- Build in multiple safety checks, not just one
- Holiday deployments are never "quick fixes"
🧱 Prodigy and the Bad Block
At 2 AM, I drove 90 minutes to a Prodigy data center. Their NetView box (RS/6000) had failed. Unlike AS/400s with journaling, RS/6000s couldn't easily recover from failed drives. The restore kept failing with a flashing 888 at different files each time.
Diagnostics said the drive was fine. But I insisted on replacing it anyway—sometimes you have to trust your instincts over the tools.
Hardware Intuition in Software Age:
- Cryptic error messages often hide simple hardware problems
- When debugging gets circular, change the hardware
- Trust patterns over individual diagnostic results
- Sometimes "software errors" are hardware lying to software
🚨 Hoover 2.0 and the Ghost Variables
You'd think I learned my lesson after the first Hoover incident. Hoover 2.0 had protections—environment variable checks, path validation, dry-run modes. Until someone ran it without setting the scope variable.
It defaulted to .
and wiped production logs before we caught it. From then on, every script had paranoid checks that would make modern security teams proud.
Modern AI Development Parallels:
- AI assistants need the same paranoid validation
- Never trust that context variables exist across sessions
- Build multiple safety nets, not just one
- Make failures loud and obvious, not silent
- Document your assumptions - they will be wrong
From Shell Scripts to AI Workflows
These early disasters shaped how I approach modern AI-assisted development. The same principles that prevented Hoover 3.0 from destroying production now guide how I structure AI workflows with Claude and CLINE.
Today's .clinerules
directories and MCP protocols are direct descendants of those paranoid shell script validations. The technology changes, but the need for defensive programming remains constant.
Submit Your War Story
Have a war story that haunts your dreams or taught you something invaluable? Every experienced developer has at least one story that makes them check their backup procedures and review their monitoring setup.
Stories can be anonymous and will focus on the technical lessons learned rather than blame. The goal is to help other developers avoid similar pitfalls or at least be better prepared when disaster strikes.