Operation Bootstrap

Web Operations, Culture, Security & Startups.

Monitoring Is Pretty Cool - AssimMon

| Comments

Yeah, I know Monitoring Sucks, but it can be pretty neat as well. It’s a hard problem, so there are all kinds of interesting approaches to it. A new one I came across recently was the Assimilation Monitoring Project. I met Alan Robertson (Linux-HA founder) at a recent Cloud Computing meetup and in the post-meetup discussions he talked a bit about his project. One of the things he mentioned was that our monitoring systems spend the majority of their time (proportional to the quality of your system I guess) detecting and reporting that everything is ok. His project aims to distribute that task across your systems in a way that scales to quite large infrastructures & is inherently redundant.

I’m not sure how ready it is for prime time, but take a look – it sounds pretty interesting.

I think I’ll always have an interest in monitoring problems – not because I have to as part of my job, but because the problem has such a wide variety of potential solutions with different benefits. It’s also one of those areas that’s very rarely just plug and go, you have to architect it like any other service which keeps it interesting.

Why Do We Insist on Consensus on the Role of Ops?

| Comments

I’ve seen so many threads over the last few weeks about who should do what, why, and what you should do about it if you don’t conform. I don’t get it. Ops is a team in a company – there are lots of types of companies. Companies typically have a few goals:

1) Make money

2) Change the world, as long as we can do #1.

Lots of companies accomplish these goals doing things wrong. If you want proof, read Good to Great, there are oodles of examples of companies who didn’t qualify as “great” but who you would recognize as successful.

When wagon trains migrated families west across the US, the idea of driving 40mph, of crossing a state in a day, would have been crazy talk. Then came the locomotive.

When locomotives moved people across the country, the idea of a car making an interstate trip would have been crazy. It would be madness if everyone operated their own car. Then came cars, and road, and traffic signals, and road signs. This took time, lots of mistakes, lots of retrospectives, and year over year progress.

Progress isn’t made by conforming to the conventions of today, it’s made by pushing for something better. That’s what some folks are doing in Ops today – they are trying to push the limits and do what works for them. Others are observing these patterns and following suit. Still others are sitting back and saying “That ain’t right, my process works just fine”. Perfect.

It wasn’t necessary for automobile manufacturers to convince railroad operators that the car was the future. The car became the future because people adopted it, because it worked, and because over time the infrastructure that supported it became more mature.

As our tools get better, as our patterns become more and more repeatable, as we start to understand what roads & traffic signals & road signs we need for Ops to get out of the way of Developers making changes in production, things will move. In the mean time – talk about what works for you, why it works for you, and don’t bother convincing other people why it should work for them.

Sometimes It Takes 2 Days to Do 2 Hours of Work

| Comments

I hear all the time:

“I could get that done in a few hours, easy”

“I could whip that up in 2 seconds”

So what? Instead of bragging about how awesome people should think you are because of how fast you say you can get things done, how about you ask some questions?

“When do you need it done by?”

“Do we have time to improve this other widget to make fixing your thing easier?”

We are so focused on trying to crank out as much as we possibly can, we sometimes think it’s better to talk about how quickly we can get things done. Instead, under commit & over deliver. If you have 2 days to get something done and you only need ½ day – spend some time improving something that will make your life easier. Understand what the expectations are before you commit & see if you can get in some extra benefit. If you hit a snag & end up taking 2 days to finish then nobody is disappointed, but if you don’t then you get some extra work done & still meet expectations.

I know folks prefer accurate estimates & like to fill your day with the stuff they want done – but don’t complain that you can’t get your stuff done if you aren’t under committing once in a while.

Getting Out of the Way - Monitoring

| Comments

In recent years I’ve come to view Operations as a traditional bottleneck that developers have become comfortable with. I think this fact is changing rapidly to give Developers more visibility into how their application behaves in production & to allow faster delivery of value into production environments through things like continuous deployment.

One of the areas Operation is often a bottleneck is monitoring. The traditional model is to have Ops ask Dev what metrics they need monitored & to set those up. This often means that monitoring can’t start until the metrics are available in the code, and then it isn’t for days or weeks after that when some Ops person has time to setup the monitoring system to pull those. This is broken and unnecessary.

If you are operating in a service delivery model where you have control over all the systems you monitor, you should be working to get out of the way. You should be working to make the monitoring happen automatically without Ops involvement. This doesn’t mean that Dev does all the work, what it means is that Ops selects monitoring systems that allow for discovery of new metrics & automatic collection of those metrics without additional incremental work each time.

Some of this is technology selection, some of this is architecture, and some of this is just doing the work. This does take work – but I would be hard pressed to find an example where the work required to set this up is not offset by the work saved in the long run not having to respond to every new metric that gets added. Below are some concrete examples of what I’m talking about – if you aren’t familiar with this read these.

Metric Collection

The yammer metrics library has made it really easy to expose your application metrics automatically. They have additionally provided hooks into tools like Ganglia and Graphite for pushing metrics to the monitoring system. As you look at how to scale a monitoring system, these are great tools to allow for that. Another popular data collection tool is statsd. The main idea is that you want to use collection tools that don’t have to have metrics pre-defined for them. If you give them a value for a metric they track it, that is all. The more often you give it to them, the more numbers they store.

Graph presentation

Ganglia is great for allowing you to programmatically define graphs and manage those via your CM system of choice, like puppet or chef. Another approach is something like Graphite which provides a rich and generic UI for taking whatever metrics you collect & combining them into a graph. Building custom dashboards and such in Graphite is where it’s strength is at.

Alerting

Nagios. We all dislike it, but it works pretty well. The main advantage Nagios has over more “intelligent” systems is that it can be configured through your CM system of choice. Additionally, Nagios has a massive community behind it. When building out Nagios or whatever you use, do your best to drive your configuration through CM and try to get things to the point where you don’t have to do any incremental monitoring work for each new system you add. New systems that are the same type as a system that’s already defined should just get monitored for “free”.

Summary

Think of monitoring like a service, like any other application in your architecture. You want it to discover what’s out there and configure itself as much as possible. Doing this isn’t completely simple yet – but it’s possible and if you set your mind to it you might even find a way to do it better that you can contribute back to the community. In doing this, try to get out of the way of your Developers and strive to have metrics they expose in their application automatically show up in your monitoring system of choice. Try to make it very low cost for them to add new metrics to see new information & you will probably be surprised at the amount of monitoring you get when all a Developer has to do is write the code to track the metric & it shows up in prod.

Overcoming Obstacles

| Comments

Three days a week, unless there’s something seriously preventing it, I scarf down my lunch and head down to the bouldering gym for an hour of climbing. This isn’t a post about rock climbing, but in the short time I’ve been climbing (about 8 months) I’ve found the whole practice to change the way I think about some problems. Rock climbing is a physical sport, but it’s also a very mental sport. Climbing difficult routes is a combination of strength, technique and determination. The process of learning is interesting as well, because you don’t hear a lot of people giving advice in the climbing gym. Folks lead by example.

So, I wanted to draw some parallels to my observations about climbing and working through obstacles in tech. We deal with new challenges every day and some can be pretty intimidating. We also learn and teach a lot, and I think there are some lessons about that as well.

Don’t allow obstacles to defeat you before you start.

When you approach a new route it can be intimidating, you aren’t really sure what to expect, what part of a route is going to be most difficult. This is true of approaching many problems, but just because a problem looks intimidating does not mean it cannot be overcome. There is a lot we don’t understand until we are working our way through a problem and telling yourself you can’t do this isn’t going to help. Make one move at a time & do your best – rarely is the situation so critical that you cannot afford to adjust as you learn.

When you miss, inspect and adapt.

The bouldering gym is full of big pads. Those aren’t there because nobody ever falls. Everybody falls. This is part of the process of challenging yourself, part of the process of trying new things. You go to the gym to fall because it’s safe to challenge yourself there & learn how to improve.

Too often I hear folks who are afraid to fall, afraid they might choose the wrong path when working through a problem in their life, their career, or some technical issue. It just isn’t possible to know the right path 100% of the time so don’t bother, do your best. When the inevitable fall happens, take another look at your moves, try to understand what went wrong, and try again a different way.

Inspecting and adapting to what you learn is one of the greatest skills you can learn. Freeing yourself to make mistakes removes a lot of barriers that you thought were there when they actually weren’t.

Watch others

In the Gym this means literally watching other climbers. Some climb with such grace that it makes things look so easy. This is true of a lot of things – so look at what others are doing. We all experience problems in different ways and we all solve them in different ways, learning from each other is key to progress. But keep in mind too that what works for one may not work for another. A tall person will climb a route much differently than a short person, they have longer reach, they also have a different center of gravity. Use ideas you see, but don’t get too upset if those same ideas don’t work for you.

Be patient

When you first start to climb, as when you first get into most things, there is a period of fast improvement. You feel great, you are learning fast, you must be awesome. As you learn more and as you start to approach more difficult challenges, your ascent will seem to slow. You are getting better, you are learning stuff, but it’s not as easy as it used to be. Once you’ve been doing this for 15 years, the problems that are hard to overcome aren’t about learning how to use some new programming language or learning to deal with some new technology, they are the finer points that actually make you better day to day. Those things take time to overcome, they are hard problems that require discipline and persistence like you have never needed before.

Climbers who have been climbing for many years will tell you that it becomes very hard to progress to the next level. Each progressive level requires significant improvement & a lot of work. You have to be patient & keep at it, you have to love climbing to climb, and you will improve.

Be Helpful

The only reason I am where I am at is because there were people who were willing to help me along the way. When I first stepped foot in the climbing gym there were people who showed me the basics. When I was clearly struggling with a route, there were people who climbed it & offered advice. When I have had problems finding that missing semicolon in a sea of code, there have been others willing to lend a 2nd set of eyes to find it.

We need each other to overcome obstacles and we each bring a different set of skills to the table. Your being helpful contributes to that, just as you have leveraged others helpfulness to get where you are at. Give back and help out.

Be Nice

It’s easy to be arrogant. It’s easy to tell someone that your challenges are more difficult than theirs. It also serves no one but yourself. Be kind as you work your way through your challenges because relationships matter more than any ability you could ever learn.