Engineering War Stories - Equilateral AI

🎯 The Killer Script

In the early days of IBM RS/6000 systems, I wrote two scripts: monitor to gather system stats and merrimack to collect and summarize them. A mistake in the vmstat and iostat parameters caused thousands of zombie monitors to spawn across the system.

My solution? A new script called `killer` that scanned `ps -ef` for rogue processes and `kill -9`'d them by PID. Effective—and very Unix.

This was before modern process management, before containers, before graceful shutdowns were standard practice. Sometimes the most direct solution was the right solution, even if it wasn't elegant.

Lessons for Modern Development:

Monitor your monitoring - recursive problems compound quickly
Simple, direct solutions often work better than complex ones
Always have a cleanup strategy for runaway processes
Name your scripts descriptively - killer was unambiguous

🌀 Hoover Eats Everything

At ADP, I built a CD-ROM mastering system for client mainframe reports. Year-end processing was buckling under leftover working files. I wrote a cleanup tool named Hoover to remove temp files based on CD status. It worked perfectly on my development box.

But on Christmas Eve, Hoover ran wild at a live site due to a missing environment variable—executing rm -rf . and deleting up the directory tree until it crashed the system with the infamous flashing 888 code.

# The dangerous default that taught me everything about defensive programming
CLEANUP_PATH=${CLEANUP_PATH:-.}  # This dot nearly ended my career
rm -rf $CLEANUP_PATH/*
          

I had to ship a fresh mksysb tape overnight to recover the system. Christmas was spent rebuilding production infrastructure and writing much more paranoid shell scripts.

The Birth of Defensive Programming:

Never trust environment variables to exist in all environments
Always validate paths before destructive operations
Test in environments that mirror production exactly
Build in multiple safety checks, not just one
Holiday deployments are never "quick fixes"

🧱 Prodigy and the Bad Block

At 2 AM, I drove 90 minutes to a Prodigy data center. Their NetView box (RS/6000) had failed. Unlike AS/400s with journaling, RS/6000s couldn't easily recover from failed drives. The restore kept failing with a flashing 888 at different files each time.

Diagnostics said the drive was fine. But I insisted on replacing it anyway—sometimes you have to trust your instincts over the tools.

The next restore worked perfectly. The real issue? Bad blocks and an unimplemented call to the relocation routine—a "software error" that was technically true, but deeply unhelpful.

Hardware Intuition in Software Age:

Cryptic error messages often hide simple hardware problems
When debugging gets circular, change the hardware
Trust patterns over individual diagnostic results
Sometimes "software errors" are hardware lying to software

🚨 Hoover 2.0 and the Ghost Variables

You'd think I learned my lesson after the first Hoover incident. Hoover 2.0 had protections—environment variable checks, path validation, dry-run modes. Until someone ran it without setting the scope variable.

It defaulted to . and wiped production logs before we caught it. From then on, every script had paranoid checks that would make modern security teams proud.

# The paranoid validation that saved my career
if [ -z "$CLEANUP_PATH" ] || [ "$CLEANUP_PATH" = "." ] || [ "$CLEANUP_PATH" = "/" ]; then
    echo "FATAL: Invalid or missing CLEANUP_PATH. Exiting."
    exit 1
fi

if [ ! -d "$CLEANUP_PATH" ]; then
    echo "FATAL: CLEANUP_PATH does not exist. Exiting."
    exit 1
fi

# Triple-check we're not in a system directory
case "$CLEANUP_PATH" in
    /bin*|/usr*|/etc*|/var/log|/)
        echo "FATAL: Refusing to clean system directory. Exiting."
        exit 1
        ;;
esac
          

Modern AI Development Parallels:

AI assistants need the same paranoid validation
Never trust that context variables exist across sessions
Build multiple safety nets, not just one
Make failures loud and obvious, not silent
Document your assumptions - they will be wrong

From Shell Scripts to AI Workflows

These early disasters shaped how I approach modern AI-assisted development. The same principles that prevented Hoover 3.0 from destroying production now guide how I structure AI workflows with Claude and CLINE.

Whether it's a shell script or an AI assistant, the fundamental rule remains: assume everything will go wrong, and build your safeguards accordingly.

Today's .clinerules directories and MCP protocols are direct descendants of those paranoid shell script validations. The technology changes, but the need for defensive programming remains constant.

Submit Your War Story

Have a war story that haunts your dreams or taught you something invaluable? Every experienced developer has at least one story that makes them check their backup procedures and review their monitoring setup.

Stories can be anonymous and will focus on the technical lessons learned rather than blame. The goal is to help other developers avoid similar pitfalls or at least be better prepared when disaster strikes.

🧨 Engineering War Stories