Building software is hard. Building cloud software is even harder because things move much faster — and require mission-critical reliability and availability. To effectively build software in the cloud, engineering teams need observability, CI/CD, reporting, and lots of tooling. But all of the tools available to engineering teams never quite fit together in a way that provides visibility and consistency. When things go wrong, developers scramble to troubleshoot systems with disparate data and systems.
TechOps teams are in charge of keeping everything running. But poorly integrated toolsets create an environment where teams have several interfaces and data sets to wrangle when operating critical services. Teams often try to solve this problem by creating one-off integrations of out-of-the box tools with internally developed tooling and process. These integrations are generally very shallow, and create a significant maintenance burden and reliability gaps.
Custom integrations provide more places to store data and a wider pool to search, resulting in a decentralized view of the data sources and no easy way for developer collaboration. What’s needed is an open source-based control center for collaboration and proper integration with current systems — no more copying and pasting. But it’s important to make the centralized command hub center work for everyone at the organization… not just front line developers and SREs.
Challenges at every level
Challenges for operating, monitoring, and incident response exist at all level of our organizations. TechOps teams are focused on hosting, deployment, and reliability of services. These teams have specific concerns to address before, during and after a potential incident. How can developers get early warning of a service outage? How do we sort through large volumes of monitoring data to troubleshoot failures? How do we track the status and progress during an incident? How do we document the work that was done to restore the service? How do we gather all of the relevant incident information for the retrospective and RCA documents?
Let’s say there’s a service-interrupting issue. At the developer level, the teams need detailed monitoring and log data. Having a centralized control center provides easier access to this data, improving efficiency and offering perspectives on how to solve future problems.
Engineering leads have roughly the same goals as developers on the frontlines of the issue, but they are more focused on high-level, business-oriented trends. This broader perspective means that they primarily want a less granular view of outage data. These users will spend more of their time focused on analyzing trends in outages over time, understanding the current status and next steps for an ongoing incident, and ensuring proper communication with internal and external stakeholders.
At the Senior Management level, executives need high-level answers to explain problems to their customers. During major service disruptions, CEOs are often in constant communication with their major stakeholders providing status about why services went down. Rather than granular outage data, these discussions rely on high-level but informed and actionable business insights.
Addressing the disconnect with open source
Clear data and collaborative workflows are critical at every level of an organization. But the real power lies in integration — not standalone solutions. By leveraging the flexibility of open source software, teams can create collaboration systems that reduce downtime, avoid confusion, enable speed, and increase efficiency.
When compared to internally developed one-off systems, open source solutions typically scale better, provide higher quality and reliability, and lower the overall maintenance burden for TechOps teams. Creating a streamlined Ops process with proper visibility and integrations improves developer productivity. It also boosts workplace satisfaction and helps reduce developer burnout.
One of the major problems with custom in-house tooling for TechOps is maintenance. This tooling may work great when it’s first built. But over time, requirements shift, and maintenance work for internal tooling often falls to the bottom of the priority list. Meanwhile, new tools are inserted into the tech stack, and common dependencies aren’t always updated and managed appropriately. The result? The tooling we all rely on breaks in an ugly way as soon as we have an incident or outage. This leaves teams scrambling to restore critical services without proper visibility and control into their systems.
Implementing an open source solution also improves a team’s ability to maintain the software needed to solve future problems. When organizations adopt open source, they’re gaining access to underlying source, backed by a community of independent contributors, with flexible, layered extensibility. This allows the team to speed up maintenance and deployment of the software so they can focus on solving issues quickly and improving systems for better operations in the future.
Flexibility is one of the top traits organizations look for in developers. But to achieve complete flexibility, organizational software needs to match these human expectations. Without open source enabling this flexibility, TechOps is a mess. On the other hand, integrating tools into a centralized view makes cross-organizational collaboration easier and addresses diverse challenges at every level of a modern organization.
Photo Credit: Rawpixel.com/Shutterstock
Chris Overton is Vice President of Engineering at Mattermost, Inc. Previously, Chris led engineering at Elastic, where he was also responsible for the Cloud product division. Chris is an expert in building and operating public and hybrid SaaS services, distributed systems, analytics/processing of large data sets, and search.