Worked fine in DEV, OPS problem now

During the past year at work, we did a complete rewrite of our websites from scratch. Not only did we aim to build a mobile-first responsive website with high performance, we also tried to do it with continuous integration and continuous delivery in mind. All that on a proprietary platform not built with CI in mind. This was a very big challenge, which involved a culture change in a lot of people. Unfortunately, the project had a hard deadline. Things were left out. Corners were cut.

We did succeed in making one of the fastest websites among its competitors. In many aspects the rewrite was a success. We definitely delivered a better implementation that the previous and this one can grow. To give a hint, the old implementation had no tests whatsoever and people could upload custom code directly on production.

However, people are a bit happier with the result than what they should. Mostly because they judge the success based on the sleek appearance and impressive load times on the frontend. Which reminds me of the following cartoon:

frontend vs backend

While things are not looking bad on the frontend, our OPS are starting to discover what goes on in the backend. To name a few:

  • no monitoring, especially on third party services
  • random error logs with poor organization
  • poor implementation for queue-like operations

(We left out a lot more on the backend, but this is what the OPS notice most)

This leads to a lot of manual work for the OPS and a lot of knee-jerk reactions on every incident, leading to ad-hoc panic and overtime.

Some interesting points emerge here:

  • The OPS is an island. Our OPS do a great job as a first line defense. However, they are totally unmanned when it comes to having people fix these things. We do have development teams working on the same platform, but they are working on new business features. OPS has no way of prioritizing a big improvement or a new OPS feature.
  • Most of these things that OPS are missing were actually planned and thought of in advance. But due to the “Deadline”, they were never implemented. This has to do a little bit about technical debt. The business doesn’t pay their debt. It is understandable to have a deadline for various reasons. But then continuing on with implementing new features on an effectively half-done platform will lead to problems down the way.

At the same time, we have four scrum teams focusing on that platform, creating new business requested features. When they estimate a feature, they won’t of course take into account OPS-needed features that are completely missing in the platform. For example, if they’re making a new frontend widget, they won’t take into account that we need automatic checks for valid HTML, since we haven’t build this at all. If they’re making a new webservice, they won’t take into account that we need live monitoring for the availability of the service on production.

Why do the developers not take these things into account?

  • They don’t necessarily have that mindset. Developers often think their job is done when the product owner approves the ticket.devops
  • There is no OPS member in the team during planning poker. OPS is an island.
  • Even if senior developers propose we implement service monitoring, the product owners will be surprised (to say the least), since we never did it before. And they have already a Deadline imposed to them anyway, so this will drop out again and become the technical debt that will not be paid.
  • The norm is already defined by the existing implementation.

How to go about fixing these problems is beyond my knowledge or my responsibility to begin with. But I find these problems interesting as well as painful. I think it is perhaps useful to see what other companies are doing. How are they organizing their teams? What talents are they hiring for? Are developers empowered to work on things they think they have priority? Or are developers passively working solely on business priorities? Do developers have a passion for technology or is it just a job? Does the business understand the importance of having a strong technology culture or is it just a necessity to keep the machine working?

Some of these questions are broad and identity defining for an organization. I have ideas but not answers, in the sense that I don’t know what would definitely work.

Back to the OPS challenges, I was thinking of a dedicated team of developers that serves OPS. However this can easily degenerate into a cleanup team, repairing half baked features delivered by the other teams. It doesn’t solve the problem at its root. It could solve the problem of repairing existing bad implementations though, without having the OPS frustrated and feeling helpless.

Perhaps what is missing is the DevOps culture. I like test driven development, in which you start with a red test and try to make it green. The software world should come up with a similar methodology in which developers would start implementing a feature by implementing the production sensors that check if the feature is working. Just like you start with a red test in TDD, you should start with a red production OPS alert.

Maybe one day?

 

Advertisements

3 thoughts on “Worked fine in DEV, OPS problem now”

  1. From all the companies ive worked for, technical debt is a serious issue. Ofcourse some technical debt gets cleaned up, but especially for front-enders, deadlines are more important. Overhauling their/the javascript of the project for example is kind of a big deal.

    From my experience, technical debt/OPS is waaay to far down the priority queue in terms of the SCRUM process.

    I’m quite the perfectionist, so handling with these kind of issues is annoying to me.

    Like

  2. I feel your pain Nik!

    I think even in the modern world of agile and devops, these remain old problems. They say it takes more discipline, and not less, to do agile properly, and that’s true. So it ends up being about the requirements. In this case, it’s about what some people like to call “non-functional” requirements, and that can lead to them being seen as less important. It’s up to us as technicians to articulate why airey-fairey things like “reliability” or “agility” have business value, or at least to verify with business stakeholders whether they have.

    You say that ops aren’t there during planning poker – strictly that’s correct if they aren’t committing to the delivery, but you are right that it’s a problem. It’s important at least to ensure that technical requirements get discussed and that their relative level of importance is understood by everyone. Perhaps then you can work towards getting some of these things into the “definition of done”. Then they’ll become part of the commitment and should be pokered for. This should be so even if you move towards devops.

    Of course, in real life, not everyone gets to follow a “pure” process. Pragmatism is still a virtue. I’ve even heard of teams appointing an “architecture owner” alongside the product owner. Maybe that helps to give an appropriate person sufficient of a voice. Sooner or later, it ends up being a matter of translating what the business needs into practice, whether this be a site feature, being able to make and deploy features quickly, or keeping it all running.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s