Embedding Ops members in Dev teams - my recent experience

For about 2 months I was sitting with a dev team while we worked through how to build a new service which will be continuously deployed. I wanted to share my experiences here because I’ve read both positive and negative opinions about doing this and I’m not sure there’s a single right answer. It was certainly an interesting experiment and one I may repeat in the future, with some modifications.

This started when I began attending this teams daily stand up. My goal was to get more involved with a single team in dev to get a better idea of how the process worked in general. This team was one of our two Infrastructure teams which focus on scalability, stability & performance enhancing changes. Initially this team was creating a new Web Services API service which I wrote a little bit about here.

Eventually that service was set aside and the team moved on to a new Authorization and Authentication service. For this new service the decision was made to use Continuous Deployment. We were already doing fully automated deploys at least once per week but there was a bit of a jump to giving the developers the tools they needed to deploy every commit including monitoring & deployment automation changes.

I had also noticed, leading up to this, that the few times I had sat over with the team I was immediately more involved in discussions – they asked me questions (because I was there) and I had the option of attending planning sessions. There was literally a 20 foot difference between my own desk & the desk at sat at “with them” but it made a world of difference. As such, I talked to my management about sitting with that team all the time and they agreed to try it.

Now, this team is a bit unique. The team is constructed of a handful of developers working on the code but it also is the home of the Build & Release guy as well as our Sysadmin who manages the testing infrastructure. Sitting with this team gave me an opportunity to not only be involved in the development of this new service but to also become more involved in the Build & Release process, getting familiar with the day to day problems that are dealt with as well as pairing with folks to work on our puppet configurations which are shared between dev & prod. This team structure, along with me, also made them uniquely suited to tackle the Continuous Deployment problem (at least for this service) completely within a single team.

As part of the Continuous Deployment implementation we wanted to make it as easy as possible for developers to get access to the metrics they needed. We already had splunk for log access but our monitoring system required Ops involvement to manage new metrics. So as part of this new service we also had to perform a spike on a new metric collection/trending systems – we looked at Ganglia & Graphite. We weren’t trying to tackle alerting – we just made it a requirement that any system we select be able to expose metrics to Nagios. I worked with the developers to test out a variety of ways for our application to push metrics into each of these systems while also evaluating each system for good Operational fit (ease of management, performance, scalability, etc).

Throughout this process there were also a lot of questions about how to perform deployments. How many previous builds do we keep? When and how do we rollback? What is our criteria for calling a deployment successful? How do we make sure it fails in test before it fails in production? What do we have to build into the service to allow rolling deploys to not interrupt service? The list goes on – these are all things that you should think about with any service but when the Developers are building the deployment tools they become very aware of all of this – it was awesome.

After about 45 days we had the monitoring system selected & running in production and test, we had deployments going to our testing systems and we were just starting to deploy into production. We now had to start our dark launch, sending traffic from our production system to the new service without impacting production traffic so we can see how this backend service performs, whether it is responding correctly to production traffic & generally get a better understanding of behavior with prod traffic. Today this service is still operating dark as we tweak and tune a variety of things to make sure it’s ready for production – again, it’s awesome.

60 days in things started winding down. We had been dark launched for a few weeks and largely the developers had access to everything they needed – they could look at graphs, logs, if they needed new metrics they just added them to the code and they showed up in monitoring as soon as they deployed. We got deploy lines added onto the graphs so we could correlate deployments with trends on the graph – more awesome. However my work was winding down, there were fewer and fewer Operational questions coming up and I was starting to move back toward working on other Ops projects.

As I looked back on the last 60 days working with this team I realized the same 20 feet that kept me from being involved with the development team had now kept me from being involved with the Ops team. I was really conflicted but it felt like the healthy thing to do would be to move back over into Ops now that the work was winding down. I immediately realized the impact it had as people made comments “wow, you’re back!”… seriously folks, I was 20 feet away! You shot me with nerf darts!

So now I’ve been back over in Ops for a few weeks and there has actually been a change – I’m still much more involved with that Dev team than I was from the start. They still include me in planning & they come to me when there are Operational questions or issues that come up around the service. However, that 20 feet is there again and I can’t hear all the conversations and I know there are questions that would get asked if someone didn’t have to stand up and walk over. Our Dev teams tend to do a lot of pairing and as a result aren’t often on IM and email responses are usually delayed – pairing certainly cuts down on the email checking.

Was I happy I did it? Absolutely. Would I do it again? I think I would – but I would constrain it and set expectations. I think the physical proximity to the team helped a lot to move quickly and toss ideas around while the service was being developed and decisions were being made but it did have an impact on my relationship with the Ops team that I wish I could have avoided. I think continuing to move back and forth – spending time with the Ops team would be helpful. I actually did spend my on-call weeks (every 4th week) in Ops instead of sitting with the Dev team, but I would try to find some time during the 3 weeks in-between to be over there too, it was just too much absence.

All that said, I think overall the company and the service is better for the way this turned out and for me personally it was a super insightful experience that I wish every Ops person could try sometimes.

Operation Bootstrap

Web Operations, Culture, Security & Startups.

Embedding Ops Members in Dev Teams - My Recent Experience

Comments