Only he can work on that

Only Phil can work on that section of the code. It’s fragile, there’s no test automation, our product owners don’t know how it’s supposed to work in its entirety, so specifications are often written in the form “Make it do what it does today, except for this one change.”

Naturally, this module that only Phil can work on is Mission Critical, and this is a drag on the business.

What if Phil finally quits? What if Phil is hit by a bus? What if Phil just wants to work on something else for a change? I have seen so many companies refuse to address risks to their Business Continuity that I’m convinced this utterly insane and irresponsible way of operating is in fact good and normal and I’m the crazy one. Maybe if I got an MBA I’d understand.

I have two stories to tell, and it seems to me that you don’t want to find yourself in a similar story.

Which Button?

One company I worked for had a very small IT staff given our extremely specialized production hardware and the needs of two offices plus a nation wide distributed workforce. As I took over management for most of the software delivery engine, folks quit. (I’m told this was a long time coming and not related to me) My final IT guy rage quit to a customer over email: I was CC’d on a response to an email that said “I don’t know, because I don’t work here anymore.”

Luckily, we had just gone through the exercise of getting me fingerprinted, authorized, etc. for our hosting facility, because shortly thereafter a very large enterprise client called and asked that a QA server we hosted for them was hung and needed to be rebooted. It was unresponsive to remote admin.

So I went to the data center.
I entered the cage.

I looked at the blinking lights.

I realized I did not know which server to reboot. Naturally very little was labeled.

If you don’t want to find yourself in a data center staring at racks of machines and wondering which button turns off Production and which one is someone’s unauthorized Quake server, it behooves you to have a business continuity plan.

Call Bob

Another company I worked for accidentally lost the source code to a mission critical software module. This is a very large company with thousands of employees and a lot of money. Despite one of this company’s core competencies being “Risk assessment”, they had never turned that lense inward. Whenever something went wrong, their only recourse was to call Bob. Bob was retired, but for an eye-watering hourly rate he would come in and debug the Mainframe Assembler code that was the only way to determine why an unexpected number came out the other end of a process. Sure, the company has a lot of money, but Bob isn’t going to live forever. Maybe betting the farm that a single person in their mid-70s will always be around to save the day isn’t a great idea, but then again we’ve already established that I am the crazy one here.

Why?

In court, smart prosecutors will tell you they don’t as a witness any question they don’t already know the answer to. I have to admit to being stumped on this one: why don’t businesses fix it? Yes, it’s easy to keep going without addressing these issues now, but the risk is phenomenal.

Countless times, I find myself sitting in planning events with the same problem: Bob and Steve are busy, and other teams are starved for work. Everyone is frustrated. My question is always the same:

What are we doing today to ensure we are not in this situation a year from now?

All solutions require investment: Is it a gigantic test automation effort so that non-expert teams can make changes with confidence? Is it requiring product to write up a comprehensive spec? Is it having Bob and Steve be shadowed by other teams?

One thing that is certainly not the answer is putting it off to an imaginary future state where you can “Catch your breath”, when there’s “Time to think”. These times never come, unless your business is in decline. Repeat: stop dreaming of a time when you’re “caught up”, it will never happen. It’s difficult, it will delay other things, but it’s better than being in a server room staring at blinking lights and wondering which machine is the right one to reboot.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.