I’m convinced Humpty Dumpty is a story of DevOps gone wrong.
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again.
First, who asks a horse to do surgery? Hoofs can’t hold scalpels. Second, either the king’s men are inept or they’re not communicating. Two kindergarteners with some Elmer’s could have done the job.
You see, Humpty is a deploy. He was fine in staging but shit the bed in production. Now the site’s down and your boss is threatening everyone’s jobs. IT is saying the code is broken. The developers are saying it’s a server issue.
Meanwhile, Humpty is bleeding out. And your customers are complaining on Twitter. Which means a customer service rep has entered the #incident channel to tell you the site’s down. Yea, no shit, Tom.
We’ve all been there. A deploy goes awry and the entire department is up in arms, defending themselves and blaming each other.
All The King’s Horses and All The King’s Men
So there we are. In chaos. The site’s down. The boss is pissed.
I don’t know about you, but I always think best when my boss has morphed into Al Capone and I’m staring down the barrel of a metaphorical Tommy Gun.
At first glance, you might think we’ve assembled the best team to handle the crisis. There are engineers who know the code inside and out and an ops team that can handle any systems fire.
Yet, it doesn’t work out that way. It always devolves into a blame game. You know the scene.
- “It’s a server issue.”
- “My code worked in staging.”
- “A configuration must have changed.”
Operations people say…
- “Has to be a code change from the last deploy.”
- “What was just deployed?...”
- “Why are we the only ones that know what is broken?...”
All the king’s horses and all the king’s men aren’t working together.
Let’s Stop Fighting
Ok, first, I’m not going to address all-out fist fights. Because honestly, if your department hosts a weekly version of Fight Club, you should really change jobs.
I’m talking more about what could be described as friction, attitude, or a general inability to tolerate each other without eye rolls and audible sighs. What I like to call good ’ol Southern-style passive aggressiveness.
I’ll give you an example.
I recently had some mild conflict with one of our DevOps guys. A day after a deploy, a feature on one of our sites — the ability for admins to upload new photos — wasn’t working. The user didn’t receive an error message after uploading. The new photo just didn’t show up.
A project manager (PM) messaged me and an ops guy.
I had 20 minutes before another meeting — why are there so many meetings?! — so I felt a little pressure to locate the issue quickly. I should have recognized that I was on edge and ill prepared to deal with the situation at that moment.
But I didn’t. I’m human.
To make matters worse, an eerily similar issue had come up during testing in QA. During that code hunt, the ops team lovingly implied that it was my problem. And after an hour or two of log reading and double-checking my work, I discovered it was in fact an ops issue.
Not that it isn’t my fault sometimes. I make plenty of mistakes. And then I obsess about them for months...
Fast-forward two weeks and here we are again. So I’m primed and have plenty of attitude. My bad.
I holler across the room to see what the logs say.
“I don’t know.”
Um, wanna go look it up?!
OK, I didn’t actually say that. But I’m 80% sure I got the message across with my eyes.
So I track down the production logs while coordinating with the PM so I didn’t have to test the issue in production.
Minutes pass and now I’m in my meeting trying to do both. Because multi-tasking is proven to be so effective.
The production logs have nothing but 200s and everything looks good.
Finally, the ops guy checks the S3 logs. Surprise, surprise. The image is there. Pff! Not my fault. (My inner dialogue may or may not be an eight-year-old.)
Yep, you guessed it. Another ops issue.
Now it’s not that I think operations issues are easy. They scare the shit out of me. But I get a little huffy puffy when I’m constantly met with “it must be the code.” And I’m sure it’s beyond irritating that ops teams constantly get “it must be a server issue.”
Which brings me to my core point: we need to work together, guys.
Change is hard. I dislike it as much as the next person. But I think this cultural shift is worth the struggle.
By now you’ve probably heard something about DevOps. It’s all the rage these days.
But if you aren’t an expert in what exactly the term DevOps means, here’s a quick history.
The term was coined by Patrick Debois and Andrew Clay Shafer while attending a conference in 2008. Hilariously, Patrick had planned to speak about DevOps at the event, but received such negative feedback that he decided to skip his own session. (TIL don’t give up on ideas just because you get a poor response.)
John Allspaw and Paul Hammond joined the #devops conversation with a talk called 10+ Deploys per Day: Dev and Ops Cooperation at Flickr. The talk is 40 minutes but very much worth your time. Just play it during dinner tonight. Your kids are gonna love it. Promise.
Since then, DevOps has become a term that encompasses a company culture where developers and operations people work together.
Before we continue toward DevOps nirvana, it’s important to recognize your development team has a fundamentally different priority than your operations team.
Like it not, developers are measured by the number of features they release. No CEO has ever cracked open code to review your thorough test suite or pondered at the glorious variable name you picked out. (I appreciate it, though. So you have that going for you.)
If all of us decided to tackle our growing mound of tech debt this month instead of working on the latest and greatest idea your sales team came up with, you better believe we’d be hauled into someone’s office and chided.
But operations people are measured on an entirely different aspect of the business: site reliability and uptime. And you better believe keeping a site up 99.999% of the time is no easy feat.
I’ll spare you the math. That’s a little over 5 minutes downtime per year. FIVE. MINUTES. PER. YEAR.
So, to break this down, developers must deploy new code to release new features. But deploys are the most frequent cause of downtime.
No wonder we’re natural enemies.
Be The Change
What we need is operations teams that think like developers and developers that think like operations people.
It’s not easy. But it is simple.
Operations: Empower Your Developers
Trust your team
You’re on the same side. If a developer says the code works, trust them. They’re not lying to you. And they don’t want to make your life a living hell. They honestly believe the code works. Which brings me to...
Give read-only access to all developers
To what? To EVERYTHING. I’m not saying to hand out root access like candy. But you are not the gatekeeper of information. Do you like being interrupted every 5 minutes so you can copy and paste an error message? I didn’t think so.
Developers are writing the code that runs on your systems. It’s not a reach to think they should be able to get some feedback about whether it works. After all, don’t expect developers to jump in and help when they don’t have access to your machines.
Create consistent platforms
Integrated platforms are easier to develop and support. Pay attention to the parity between environments. Staging and production should be identical. That means the same allocated resources and the same data. Otherwise deploying will always be a roll of the dice.
Share source control
Keep your configuration tools on GitHub with the rest of your company’s code. Code is code. It’ll be much easier for operations and developers to solve problems together if everyone knows how to locate the affected code.
Add your devs to the on-call rotation
My friend likes to say, “You build it, you support it.” No one likes to be woken up at 2:00 a.m. And if you’re tired of stumbling through the dark toward your computer in the middle of the night, share the pain. There’s no reason developers shouldn’t be on rotation. Remember, they can access logs and view your configuration tools now. Awesome!
Pushing code to production should not be a production. Unnecessary steps increase the opportunity for error and decrease the number of people who can deploy.
Oh, one more thing. Stop preventing developers from deploying their code to the QA and staging environments. Seriously. If I have to ask permission to test my shit anywhere other than dev, you deserve to put out the fire.
Developers: Stop Being Assholes
Make operations part of the planning process
Thinking about a feature? Include operations. Talk about what will change before you write a single line of code. Discuss why this feature is important, who will need to be involved and what the risks are. You can’t deploy mystery code and then get irritated with your operations team when they start asking 100 questions.
Make small changes. Deploy. Repeat.
If your feature requires you to change 30% of your app’s spaghetti code, break the feature into smaller pieces. Not sure if your feature is too big? Apply the same rule you use for method naming. If the method needs an “and” it’s doing too much. Small deploys make it MUCH easier to determine what went wrong in case of failure.
You know how you already included operations in your feature planning? Notify the operations team when you deploy too. Whether you use Slack or HipChat, make sure all developers and operations people have a single place to communicate. Lots of companies use an #incident channel. Find what works for you and then use it.
There’s a rule in improve that forces participants to say “yes, and…” rather than “yes, but…” Try this next time you’re in a meeting and the results will likely surprise you. That simple language change will make everyone feel heard, validated and a part of the team.
Be open to other options
If someone on operations says there’s going to be a problem, listen to them. That means shutting your mouth and really hearing what they have to say. You’re an engineer, not God. The core competency of operations is site reliability. Let them help you. The solution you come to together will be much better than the one you thought of on your own.
Have some humility
If someone was woken up in the middle of the night because of something you released, say sorry. Buy some coffee. Help ’em out. Own your shit. When you take responsibility for a mistake, your colleagues are much less likely to make a voodoo doll of you and keep it by their bed.
Practice Failing Together
Failure is never a question of if, but when.
You will fail. A deploy will bring down the site. A typo in your configuration will bring users to Twitter fisticuffs.
It happens. We’re human. And until Skynet, we’re all stuck dealing with our occasional mistakes.
Have a healthy attitude around failure
You need to 80/20 your failure preparedness procedures. It’s okay to spend 80% of your time trying to prevent failure, but devote at least 20% to practicing how you will handle failure when it happens. We all half ignore the safety talk given at the start of every flight, but I appreciate that oxygen falls from the ceiling in the event Bane decides to crash my plane.
Stop pointing fingers
Avoid blame. It never feels good to make a mistake. And when 20 people are required to rectify it, it feels even worse. When I screw up, I’m embarrassed. And if I feel attacked, I become defensive. I think most of you would probably say the same. Let’s give everyone a little slack. It could have been your typo.
Leave your egos at the door
When I started powerlifting seriously, I joined a small team of intimidating lifters. The head of the group — a 60-year-old Juggernaut-like man whose traps rose to just under his ears — had one rule: leave your ego at the door. It didn’t matter that we had to strip off 400 pounds every time it was my turn to squat. All that mattered was that I listened, learned and respected the team. We could all learn a lot from that.
Here’s a short list of books you may be interested in:
- The Phoenix Project by Gene Kim, Kevin Behr and George Spafford
- Implementing Lean Software Development by Mary and Tom Poppendieck (I love husband + wife teams!)
- The Lean Startup by Eric Ries
- Web Operations by John Allspaw
- Continuous Delivery by Jez Humble and David Farley
- The Goal by Dr. Eliyahu M. Goldratt
- The Field Guide to Understanding ‘Human Error’ by Sidney Dekker
And no, I haven’t read all those books. I’m convinced people lie about how many books they’ve read. Or I watch too much Netflix. Don’t judge me.