It’s an Ecosystem

Companies regularly attempt to customize a tool, strategy, technique, or process without understanding how it’s parts compliment each other.

Suppose you visit a friend with some land, and you discover that this friend has the most peaceful and amazing koi pond you’ve ever seen. A reasonably large pond, with beautiful trees and vegetation surrounding it, and the clear water allows you to see the koi perfectly. Best of all, there are hardly any insects bothering you.You resolve to recreate this enviable spot, but with one exception: The noise that all the frogs make annoys you.

So, you dig a similarly sized pond. You plant the same vegetation, you get the same size and number of koi. You come back after a season to enjoy your paradise, but you’re horrified to find that you’re eaten alive by insects. The same insects are ravaging the vegetation you planted, and there are so many mosquitoes nesting in the still water that you can’t see your beloved koi.

You neglected to understand how the ecosystem worked before you modified the formula. The frogs kept the insects under control, and now the ecosystem is out of balance.

So many times, I see companies try this approach. “I like Scrum, but I can’t seriously let teams determine how far to go in each 30 day increment…” “We like SAFE, but there’s no way we can separate value streams like that…” “We do DDD, but we don’t like aggregate roots…” The implementer is disappointed when they don’t observe the value that the technique/process/tool promises. Why are teams not committing to work the way Scrum says they will? Did you take away part of their self determination? Did you demand fixed scope on fixed time, or not allow teams to self-organize?

Scrum is an ecosystem of ideas. Safe is an ecosystem of ideas. The GitHub-centric way of doing open source is an ecosystem of ideas, practices, and norms.

You may have adopted part of a system that only works because of the harmony of all it’s constituent parts. Beware modifying an ecosystem you don’t understand, lest you find yourself disappointed after a great deal of effort.

Only he can work on that

Only Phil can work on that section of the code. It’s fragile, there’s no test automation, our product owners don’t know how it’s supposed to work in its entirety, so specifications are often written in the form “Make it do what it does today, except for this one change.”

Naturally, this module that only Phil can work on is Mission Critical, and this is a drag on the business.

What if Phil finally quits? What if Phil is hit by a bus? What if Phil just wants to work on something else for a change? I have seen so many companies refuse to address risks to their Business Continuity that I’m convinced this utterly insane and irresponsible way of operating is in fact good and normal and I’m the crazy one. Maybe if I got an MBA I’d understand.

I have two stories to tell, and it seems to me that you don’t want to find yourself in a similar story.

Which Button?

One company I worked for had a very small IT staff given our extremely specialized production hardware and the needs of two offices plus a nation wide distributed workforce. As I took over management for most of the software delivery engine, folks quit. (I’m told this was a long time coming and not related to me) My final IT guy rage quit to a customer over email: I was CC’d on a response to an email that said “I don’t know, because I don’t work here anymore.”

Luckily, we had just gone through the exercise of getting me fingerprinted, authorized, etc. for our hosting facility, because shortly thereafter a very large enterprise client called and asked that a QA server we hosted for them was hung and needed to be rebooted. It was unresponsive to remote admin.

So I went to the data center.
I entered the cage.

I looked at the blinking lights.

I realized I did not know which server to reboot. Naturally very little was labeled.

If you don’t want to find yourself in a data center staring at racks of machines and wondering which button turns off Production and which one is someone’s unauthorized Quake server, it behooves you to have a business continuity plan.

Call Bob

Another company I worked for accidentally lost the source code to a mission critical software module. This is a very large company with thousands of employees and a lot of money. Despite one of this company’s core competencies being “Risk assessment”, they had never turned that lense inward. Whenever something went wrong, their only recourse was to call Bob. Bob was retired, but for an eye-watering hourly rate he would come in and debug the Mainframe Assembler code that was the only way to determine why an unexpected number came out the other end of a process. Sure, the company has a lot of money, but Bob isn’t going to live forever. Maybe betting the farm that a single person in their mid-70s will always be around to save the day isn’t a great idea, but then again we’ve already established that I am the crazy one here.


In court, smart prosecutors will tell you they don’t as a witness any question they don’t already know the answer to. I have to admit to being stumped on this one: why don’t businesses fix it? Yes, it’s easy to keep going without addressing these issues now, but the risk is phenomenal.

Countless times, I find myself sitting in planning events with the same problem: Bob and Steve are busy, and other teams are starved for work. Everyone is frustrated. My question is always the same:

What are we doing today to ensure we are not in this situation a year from now?

All solutions require investment: Is it a gigantic test automation effort so that non-expert teams can make changes with confidence? Is it requiring product to write up a comprehensive spec? Is it having Bob and Steve be shadowed by other teams?

One thing that is certainly not the answer is putting it off to an imaginary future state where you can “Catch your breath”, when there’s “Time to think”. These times never come, unless your business is in decline. Repeat: stop dreaming of a time when you’re “caught up”, it will never happen. It’s difficult, it will delay other things, but it’s better than being in a server room staring at blinking lights and wondering which machine is the right one to reboot.