Operation Bootstrap

Web Operations, Culture, Security & Startups.

Observations on Dev / Ops Culture

| Comments

I am and always will be a student of leadership & design. I like to see things work, but I like it more when things work a little better or a little different than I’ve seen them work in the past. More than anything I like change, which I think of as progress when the goal is learning and improvement. When things are working well, I wonder if they could work better and I end up tinkering with them anyways. I am not a fan of the mantra “If it ain’t broke don’t fix it” because I think most things can be improved, some people just don’t see how.

I have worked for 3 SaaS companies now and all 3 have had a meaningful influence in the way I think about Operations & Engineering today. Sometimes I learn what works, sometimes I learn what doesn’t work, but I always learn. Of all the things I’ve learned though, I’m coming to view culture – that indescribable, unspoken law of the workplace – to be the most meaningful measure of a company and the one that defines whether bad judgement does indeed contribute to experience and in turn, good judgement.

For some, culture is this frilly thing that companies toss around with expensive parties, expensive computers, beer on tap, scooters flying down the halls, and so forth. Companies do this in hopes of attracting top talent by investing capital. For me though, culture has nothing to do with those things. Culture is what allows someone in Customer Support to become CTO. Culture is what allows bad decisions to be talked about and changed, or not. Culture is what keeps employees working at companies for years and years, or causes them to hop jobs every 2-3. Culture is the reality of the way a company operates and it cannot be manufactured.

This is a long post – if you just want my conclusion you can skip to the bottom.

Company A

In 2006 tech companies were just starting to recover from the dot-bomb crash. Most companies were buttoned down and while it wasn’t too hard to find a job as a Sysadmin, companies were looking for experience. “A” was an anomaly during this period. They had landed tons of funding and were ready to take over the world of mobile video. They wanted Rock Stars and they wanted a lot of them – because they needed them.

“A” hired by building a front – they had lots of money, a service nobody else had, and a team of really bright folks who wanted to do something new – video to mobile. Their culture was “go fast”, and that was about it. “A” built and deployed software on multi-month release cycles & every time they released it was a mess. They would cram as much as they could into each release, test it in environments that didn’t resemble production and then, starting at 10pm some night – we would deploy. 6-8 hours later, as the sun was beginning to come up, we would all go home… It sucked.

When production problems would arise, Ops took a first crack at fixing things. If it was someone who had been there a while they could sometimes deal with it – but largely Developers had to get involved and that could sometimes be a lengthy process. Multi-hour outages were very common at “A”.

Over the 3.5 years I spent at “A” I saw countless examples of what doesn’t work:

  • It doesn’t work to deliver service based applications on long release cycles where the requirements literally change as you develop the software. It breaks.
  • It doesn’t work to have Dev design new applications without having any understanding of how production works. They fail.
  • It doesn’t work relying on having top-notch Ops team members to keep your service up. They can’t.
  • It doesn’t work adding more machines instead of investing in Engineering effort. The costs add up.
  • It doesn’t work documenting changes to prevent bad decisions. It just doesn’t.

Over time people began to realize that these things were wrong – but turning this ship was like navigating the Titanic. “A” had developed a culture of analysis paralysis. Because releases & production changes had gone so poorly more “controls” were put in place – more rigor and more process. It was harder, not easier, to get changes into production. Every production change had to have a document associated with it – those documents required approval. I’m responsible for putting some of that in place & I will never do it again. The company became focused on establishing deniability in everything you did, a culture of CYA. It may as well have been hog-tied and hanging from a tree – it couldn’t move.

Eventually they formed a separate organization called the “CTO Office” where design decisions could be made, where Engineers could work in isolation on new ideas. The designs that came out of this organization rarely hit the mark. The team had brilliant members, but they were isolated from the realities of the day to day operation. They were a huge help when they came in on a production issue and had to turn around a fast fix, they rarely had tight competing deadlines & they were top notch engineers. But when it came to designing something from the ground up it was difficult.

What did I learn?

Change control done wrong can be very bad. Long release cycles & big-bang releases are bad. Developers not knowing how production works is bad. Testers who don’t know how production works is worse. Relying on Operations to compensate for code & design quality will fail. Isolating engineers who build new components is bound to fail. Money does not buy happiness or a good culture.

Company “B”

In looking for my next role I wanted the opposite – I wanted freedom and I found it at “B”. The company was small, about 30 people, and was on a 6 week release cycle. There are posters on the walls of the office “Be Nice, Be Scrappy, Be Creative” and that seemed to be how things worked. “B” had very different problems than “A”. At “B” the Developers were very aware of how production worked, often more so than Ops. Developers had production access & when problems arose in production (which was much less frequent than company “A”) the Developers were looking at it right alongside Ops.

The problems with “B” were largely Operational initially. Monitoring coverage was actually much better than company “A” however monitoring was so noisy you couldn’t tell what was good and what was bad. Disks would fill because logs weren’t being rotated properly. Services would die and there were manual processes to recover them. Disks went bad just about every day in their large storage service & keeping up with those along was a full time job. Operations spent all their time shaving yaks, never having any time to make things better.

One thing Company “B” did reasonably well was configuration management & automated deployments. Overall most things were managed via puppet or our deployment scripts. If we took an outage, we could deploy to the entire system (>150 systems) in around 10-15 minutes. Rolling releases were more of a pain point but were certainly common and mostly automated.

As with Company “A”, Company “B” had a strong push to move fast, get features out, gain customers. While Dev seemed to focus a lot of time on building features they also did prioritize infrastructure fixes for some things. This would change over time, but it didn’t seem so bad initially. The service ran pretty well and things stayed pretty stable.

Over time I observed that company “B” had a lot of broken windows. The logs were filled with exceptions, you couldn’t tell what was a “normal error” and what was bad. Just about every application ran under a restarter script, an unwieldy perl script that would restart the app if it died and send an email off to alert people. Often our earliest indicator that there were problems with the service was the frequency of application crashes. It became difficult to know if the service was operating correctly or not & hard to pinpoint problems when you knew there was one. Their technical debt grew with each release & Operations spent much of their time focused on break/fix efforts instead of long-term improvements.

Company “B” was also growing subscribers at a very fast pace and wasn’t optimizing their software as fast as they were gaining customers. Provisioning hardware fast enough became a real problem & keeping up with database sharding and storage growth was also a big problem. They were trying, but there was still a heavy focus on new features & growing more customers.

The culture of company “B” would have allowed for a pivot to address these issues if only funding & want for features allowed for it. Their product had arguably more features than any other in the marketplace and they continue to add more. They were obsessed with building what everyone else had and more. Yet, they still weren’t #1, there were other competitors who beat them on the basis of subscription counts & on familiarity in the Market. I had to explain to people what my company did by asking if they had heard of our competitor – they always had. “We do what they do”, I would say. I always felt crappy saying that.

What did I learn?

Configuration management & deployment automation are awesome – it was here that I decided these were key capabilities. “Broken windows syndrome” is real and has a dramatic effect on an organizations ability to address problems. Prioritizing infrastructure fixes & reduction of technical debt is as important as being aware of problems. Developer ownership of functionality of code in production is critical – knowing it works in testing is irrelevant if it doesn’t work in prod. Monitoring & trending & analytics are fundamental to understanding what your system is doing & an opportunity many companies miss.

Company “C”

I have only been at Company “C” for 5 months, but I have been taken to school. When I went out looking for something after Company “B” I was looking for a place that understood what it meant to run a service – somewhere I could stop ringing the bell of continuous delivery and get down to the business of actually doing it. While this company isn’t perfect and I’m sure I’ll have more learnings after a few years, what I’ve observed so far has changed my view on many things.

Company “C” has some key cultural aspects that make it what it is:

  • Decisions are driven through structured collaboration and consensus. Individuals have the ability to drive decisions, but there is a strong culture of sharing, collaborating and adjusting.
  • Individuals are encouraged to find their own place in the company. You are hired for a loosely defined role in the company and there’s an expectation that you will fill that role until someone else does, but you are encouraged to find your passion and put your energy into that area of the company.
  • Leaders of the company are there to help support team decisions. Like any company, there is some steering that comes from management but it is most often done in a shared and collaborative way.
  • They focus heavily on hiring for personality fit than on technical skill.

This core culture makes the company what it is and has led to some interesting responses to issues. When the service began to fall over and had significant availability issues they formed an architecture council, a group of individuals who are passionate about how the service is built. This group includes Engineering, Operations & Product members (about 20 people total). Any significant change to the service is presented to this group and discussed before being built. Major problems that require architectural change are raised to this group, discussed and prioritized.

Like other things at this company, this group was probably formed by sending an email asking for volunteers who are passionate about fixing these things.

The other thing this company does is make decisions based on data. They have no less than 3 people focused entirely on the infrastructure and analysis of test & production metrics. This includes load testing & performance testing environments. Logging of metrics for every request made in production & in testing. Detailed analytics down to the amount of DB time, CPU time, heap, etc of each request made to the system. If there is an outage, this group can typically pinpoint the # of people impacted and who they are. If there is a customer misbehaving, they can typically find them – fast. If there are performance problems they can describe them very precisely and usually tie those problems to a specific commit. This makes conversations around performance tuning much easier.

I have observed a consistent strategy to try things and be willing to change them when they aren’t working. Many companies try new things with the argument that they can change if they aren’t working but few companies have the structured process to make sure that change happens. This company does. It requires discipline.

Not coincidentally, this company also shares their experience with their customers and uses their own organization as an experimentation ground for new ideas.

As a result of all of this – despite having the smallest infrastructure of any company I’ve ever worked for they:

  • Use feature flags for every new feature that goes out, often even for changes to existing features.
  • Release weekly, the fastest pace of any of the above companies.
  • Have fully automated & unattended (scheduled) deployments.
  • Have the highest availability of any company I have worked for.

What am I learning?

I’m learning that allowing people to do their job is more valuable than telling them how to do it & allowing them to follow their passion produced incredible results. That collaboration done right can be immensely valuable and while it doesn’t allow rock stars to shine as brightly – it generates much more consistent results and overall has a more positive effect. I’m learning that the companies I’ve worked for do collaboration wrong, and that doing it right requires discipline and training. I’m learning that building a great culture has everything to do with the leadership in the company & very little to do with the product or funding.

So what does all this mean?

I’ve come to believe that every company has skilled team members. Every company has challenges growing the business and scaling their systems. Every company comes up with clever ways to solve problems. What separates the companies I have worked for is that some understand that they can’t know everything and have built a structured process to make sure change can happen as new things are learned. The others seem to believe that change will happen organically – that important issues will get fixed if they are important enough. This doesn’t happen though because the business doesn’t allow teams to choose to prioritize those important things. They don’t allow pride of ownership.

I have also observed that adding process needs to be resisted and questioned at every opportunity. Process is good when it makes decisions better, when it makes the organization more effective. Process is bad when it stands in the way of effectiveness, when it stifles agility & when it causes people to avoid good decisions because they are too much work.

I have learned that broken windows syndrome is real and in technical companies it takes a lot of work to keep those windows fixed. Knowing when problems are real and not “normal problems” is important. If something isn’t important enough to build it right then maybe it isn’t important enough in the first place. Leaving broken windows in place means you are ok with a lower quality product and the bar will only move lower – it will never go higher.

The single greatest tool I’ve seen to avoid these issues is to empower individuals to gather teams and act. Encourage them to collaborate and create opportunities for teams to come together and share experiences. Retrospectives are excellent for this. Have Operations go to Engineering meetings, have Engineering come to Operations meetings. Have everyone share their experiences – good and bad. You will spend a lot of time in meetings, it will feel less efficient – because it is. Efficiency is being traded for effectiveness. When you do act, you will know better why you are doing it and what the expected result it.

There’s not much point in efficiency if you are doing things poorly. Cranking out broken things faster makes no sense.

Lastly, I’ve learned that the book “Good to Great” by Jim Collins is right. Period.