Operation Bootstrap

Web Operations, Culture, Security & Startups.

Keep Decisions Low Cost So They Are Easier to Make

| Comments

Which question is easier to answer?

1: Select the storage platform which will serve all our needs for the next 3 years up to 10PB.

2: Select a standard disk drive to use in the next 5 servers we purchase.

Most people could answer #2 pretty easily with a little bit of information. It might take a day or two to figure this out. And if you get it wrong, what’s the worst thing that happens? You’re replacing 10 disks or so. #1 though, that’s harder – that requires knowing what you are going to be doing for the next 3 years, how much money you can spend, how much flexibility you are going to need, etc.

The cost of being wrong for question #1 could be millions of dollars.

The cost of being wrong for question #2 probably means a few thousand dollars.

You don’t always have the luxury of changing the question, but when faced with a decision that seems to be hard to make, look for ways to make it easier. One of the ways to make a question easier to answer is to lower the cost of being wrong. For question #1, how can you answer that question for the next year without excluding the possibility of expanding to 3 years? How can you get more data about your needs by starting small, gathering data, building understanding? How can you constrain the scope, not solve all your problems but solve the most pressing ones?

Try not to solve all your problems at once because there’s a good chance you aren’t going to succeed. Pick a problem, and do the simplest thing that works to fix that problem.

Work-life Balance Is Personal

| Comments

I was once (or three times) told that I was working when I should not. Usually this happens in response to a 10pm email, or an email on the weekend, or three of them. The thing is, that’s an important part of my balance and if I don’t do it I get thrown off.

I like the idea that companies try to embrace work-life balance, the ideal that says you should have a life outside of work. What I do not like, however, is that anyone thinks they know what time of day I should be working. Just because someone 50  years ago said that 40 hours a week is ‘normal’ doesn’t mean it’s appropriate for everyone. I get ideas at 10pm & I want to share them while I’m inspired.

Inspiration & opportunity come my way when they do, and when they come I have to decide if I want to make my move and get something done or wait until I’m “on the clock”. Doing the latter almost always means not doing it at all. Inspiration, after all, expires rather quickly.

What if work-life balance meant being honest with yourself and your employer? What if it meant that when there was an opportunity, you jumped on it, and for doing that you were given the flexibility to take that time back when there were personal opportunities outside of work? A number of companies do this already, it’s just informal and largely based on trust. The problem with this is that when it becomes a problem because of an individual, it has to get corrected for the whole team. Usually that correction comes in the form of “core hours” or some other lame blanket of consistency placed on the many because of problems with the few.

Part of the reason individuals see a problem in my work habits is because they feel pressure to conform to their interpretation of my actions. If I send a note at 10pm, that doesn’t mean that I am working 8 hours during the day + 2 hours at night. Some days it might mean I worked 10 hours, other days I might put in 6. The problem is that people feel pressure to do the same. They see someone else working “overtime” and feel like they need to. But they don’t know what I’ve been doing – they just see me working off-hours and assume it’s “overtime”. This turns into the ultra-competitive work-all-day-and-night environment that some companies have. That’s not what I’m talking about. That’s not what I do.

I have the good fortune of having work that overlaps with my passion, which is often my hobby. My hobbies contribute to my work, and vice versa. This blurs the distinction between work & life. My wife struggles with this – “I can’t tell if you are working or not when you are on the computer”. To which I ask “Why does it matter?”. Whether I’m working for my own goals, or working to get paid, the parameters are really the same. She doesn’t like that answer because she doesn’t want to interrupt me when I’m working, but if I’m not working it’s ok. The truth is, it’s always ok – and I’ll tell her (nicely, I hope) if it isn’t.

Maybe if we were more honest about how we worked & more open about what we expect from each other it would be easier to work this way. If we managed more based on achievement than hours or individual tasks. Maybe then we could all work better.

Are You Making Blameless, Data Driven Decisions?

| Comments

On this day of emotion & emotional decisions, this post is about not making decisions using emotions.

When I hear people use words like “could be” or “should be” I start to wonder if there’s enough data on the table to make a good decision. It’s true that sometimes you just don’t have the answer, so you have to ask more questions. How can we learn more to make this decision easier to make? How can we make this decision easier to change if we get it wrong?

Decisions made in the absence of data have another problem, the only collateral is personal responsibility. You are trusting someone’s gut, or someone’s experience, or someone’s opinion. When things don’t go right, you blame that someone.

If you are making a decision that is hard to change, it should be made when the and answer to questions are “it is…” or “it is not…”. You use those words because you have confidence you have tested & have data to back it up. You should be saying “We know” and not saying “We think”. Saying you know without data is just being dishonest. I don’t care if you’ve done this before, the people in the room have only your credibility to trust. Solutions don’t work in different environments for a variety of reasons, but if your credibility was the basis for a decision it will be the only thing that gets blamed.

Try to make decisions easy to change.

Try to make decisions based on fact, not opinion.

Try to make decisions that do not rely on your credibility.

If You Aren’t Using Feature Toggles, Start… Now

| Comments

Search for ‘feature toggle’ in Google, check out the results. The simple fact is that branching using a revision control system still has its place, but its place is not controlling when you release a feature to your customers. Feature toggles create a distinction between deploying your feature & making that feature available for use. They also remove the requirement that to disable a feature, or to go back to ‘old behavior’ you have to rollback your deployment to an older version of code. There are lots of other benefits too, as well as some challenges.

Bottom line though, if you aren’t using these you need to really seriously consider whether they would be a benefit. If you control your software release & you operate a multi-tenant system, and you want to increase the amount of control you have around the features you release, you need to be using these.

Here are some related blog posts:

http://code.flickr.com/blog/2009/12/02/flipping-out/

http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/

http://www.rallydev.com/engblog/2011/12/09/the-awesomeness-of-feature-toggles/

http://www.rallydev.com/engblog/2011/12/12/the-best-part-of-feature-toggles-removing-them/

http://www.rallydev.com/engblog/2011/12/14/testing-feature-toggled-code/

http://www.rallydev.com/engblog/2011/12/20/feature-toggles-branching-in-code/

Establishing Ownership in Ops Teams

| Comments

I’ve been having some discussions about this lately so figured I would write something about the topic. Being a member of an Ops team can be pretty challenging at times. The job can be high pressure and often it feels like you spend all your time fighting fires, shaving yaks, etc. One of the difficult parts of being in Ops is that it’s often hard to put your mark on things, to use your skills to leave a lasting impression.

The reason it’s hard to leave a mark isn’t because there’s a lack of work, but because the work changes so frequently that influencing the long term outcome of a project can be hard. This can often be even more difficult in Operations teams following Agile methodologies because the work is broken into smaller stories and those stories may get worked by multiple folks. Even within these teams though, there are individuals with skills in certain areas and often there is more than one person with passion for a particular topic. Someone who’s passionate about a topic is more likely to do a great job, in my experience, and so we should see how we can leverage it.

Roles and Responsibilities Matrix

One successful tool I was shown was a Roles and Responsibilities matrix. The goal of this is to establish some basic ownership of components within an infrastructure so that individuals can focus their work. This often happens naturally in teams, but doing this formally accomplishes a few important goals:

  • Allows individuals with no experience, but with an interest, to raise their hand and work with new things.
  • Allows the team to agree on who is responsible for what infrastructure pieces. This is not sole ownership, but more about establishing expertise & creating less contention over decisions.
  • Helps you, as the manager, formalize who to work with on specific issues.

The matrix is pretty simple, for each component (you can partition this however you want) you define two roles, a “P1” and a “P2”. These are the primary and secondary points of contact for that component. But there’s more to this than just having a primary and secondary:

  • P1: This person is the current “non-expert”, the trainee. All escalations for this component should go to them first. If they don’t know the answer it’s their responsibility to work with the P2 & resolve the issue. In this process, they learn.
  • P2: This person is the current expert, the trainer. They understand that they are P2 and are to work with the P1 on issues where they need help.

I have also observed this setup where there’s only a P1 and they are the expert because there just aren’t enough folks to have a P1/P2 for that component (or it’s not a priority). Another reason for the P1 to be the expert is if the system is going through a lot of changes and you want someone to keep tight reigns on what changes are made.

Here is what an example matrix might look like

Looking over this, each person is a P1 for one component & a P2 for some other. In a perfect world it works out like this, but the world aint perfect. Do your best with what you have – but try to setup something like this.

This is usually established during a meeting every quarter or every 6 months.  You walk through the list of functional areas and ask for volunteers. This more often than not ends with very little contest, but in the event where there are concerns about who is P1 or P2 you should try to understand why it’s important to each person to have a role in this, what they want to accomplish, and consider what other areas they also want to accomplish things in. Often, after discussing their vision on this component along with other stuff they are working on it’ll become more self evident who is the best P1 & you can get agreement.

Defining cross-functional areas

The matrix above works well, but the first question from folks is usually something like “what about monitoring, if I own that does it mean I have to do all that work for everyone else?”. The answer is “no” in most cases. There are some functional areas which are pretty clear & mostly self contained but there are others which cut across all the other areas. Examples where something intersects with everything else are Monitoring, Networking, Configuration Management and sometimes things like Storage, depending on your architecture.

For areas where your area of expertise is a dependency for others there needs to be shared ownership of those tasks. I generally look at it this way, using Monitoring as an example:

  • The P1 is responsible for overall architecture & infrastructure, training, documentation & escalations for that system. They are responsible for enabling the other team members to use the system effectively & for bringing any major changes to the team for review & consensus.
  • The P1 owners of other components are responsible for integrating their systems with monitoring, for writing any monitors, and for establishing meaningful metrics & thresholds for that system.
  • Both P1 owners work together to make sure any monitoring / metrics are done in a consistent way that is inline with what the team has agreed is the architecture.

In this way you are avoiding making the monitoring owners job suck by having to spend all day writing monitors for a million different components, but they have ownership of the overall success of the monitoring infrastructure. Individuals who own other components are making decisions about how best to monitor their own systems within the constraints of the best practices for the monitoring system & they can work with the monitoring owner if they want to break new ground on doing things a different way.

Working outside of Operations

One of the most important roles Operations plays (in my opinion) is in working with Development as closely as possible. This is becoming more and more obvious and more teams are starting to give it names, like DevOps. Some Ops folks are better at this than others and some will go out and find Developers to work with and others need to be prodded a bit.

Defining clear roles for individuals in Ops is a good way to force this collaboration. By assigning one Ops person to an upcoming Dev project & setting clear expectations around that role, you help foster their involvement and empower them to start working with other teams. That Dev team becomes a functional area, and they get a P1 & P2 like any other component.

What I would typically advocate for smaller Dev organizations is integrating one Ops person per Dev team if you can. This means that Ops person attends stand-ups, they go to planning meetings, and they are familiar with all the stuff that Dev team is working on. Should there become a need for Ops related work (or communication, which is always needed), the assigned Ops team member is responsible for that role. They aren’t necessarily responsible for all of the work but they are responsible for making sure the work is communicated & making sure it gets done.

Another approach is to assign Ops team members to individual projects. As projects arise, team members start to attend those meetings & start to get involved with any stand-ups and work around that project. I don’t like this approach as much because it relies on the Dev teams reaching out and saying “Ok, we’re ready for an Ops person now” most of the time – and that often happens late. Having Ops members already in position inside teams gives you much earlier warning and helps shape the end result much earlier.

Tracking & Communicating work

Now that everyone is working on their own projects, there will be a tendency to communicate that work less often & less completely. It takes some work to avoid this but it’s actually not all that hard. The important aspect of this is that each team member is talking about what they worked on each day at stand up & are being clear about their priorities during planning sessions. How you achieve this is up to you – but I’ll throw out some ideas.

Kanban works well as a visualization tool for work in progress. From an Ops perspective, I think that’s where the role of Kanban starts and ends. Operations is an inherently interrupt driven team and while many organizations get out of that mode through lots of practice – if you are at that point you probably don’t need my help in tracking & communicating work. Where I have seen Kanban work really well is in prioritizing work during planning (abc must come before xyz, move the card) and in visually showing what you did, what you are doing, and what you will be doing next.

Daily stand-ups are really, really helpful. Things change day to day in Ops teams and taking 10 minutes each morning to get everyone in sync with what’s going on is a huge help. Identifying blocks and talking about how to clear those is a big part of this. When everyone is there talking about things, saying “I’m blocked waiting for xyz” is an opportunity to get that problem solved today.

Also documenting proposals using a shared document system like Google Docs is a massive improvement. I can write up a proposal for something and instead of asking for feedback, people can add it right to the document – they can make comments, etc. We get together for a 30-60 minute meeting to review the document & the feedback and we take a shot at a final proposal. If there are still open questions we go back and answer those. The key is that much of the work is done asynchronously rather than asking that everyone bring their best, most un-distracted thoughts, into a meeting.

Rotating roles

Lastly, with all of this, there is change. Nobody wants to be stuck in the same role for years – people in Operations want to learn new things, they want an opportunity to take something that needs improvement and leave their mark on it. In every infrastructure there are some cool projects and there are some lame projects. There are also those parts of the system that are just a pain in the ass to maintain & nobody wants to do it. It’s important to rotate these around.

What has worked in my experience is a periodic review of the priorities. You start with a review of work in progress so that folks know what they are signing up for if they want to tackle an area they aren’t working in today. Then you wipe the slate clean & go functional area by functional area asking who wants to be involved.

The trick with this process is to try to allow folks who have projects in flight to maintain that responsibility while giving someone else a  shot at learning about the system. This is where the P1/P2 roles can really be leveraged. If you are re-building your network and you really need the same guy to maintain his momentum in that project – he becomes the P2, continuing that work. You assign a new P1 (if someone new wants to be involved) and you have them tackle the day to day interrupts. The two members work together on it and the new gets to learn while the old gets to finish their project.

If a functional area has no work in progress and you really want to move something new forward there, find the person who’s passionate about making that change and make them a P1. Find a P2 that can help enable them and let them go for it.

Wrapping up

Ownership is an important part of any job and in Operations it has been the light that keeps me coming back. Giving that ability to every member of your team is important, and hopefully this gives you some ideas about how to do that.