Key Takeaways

Maintaining a sysadmin approach for cloud-based IT operations results in hidden costs that can be mitigated by adopting Site Reliability Engineering (SRE).

Leveraging an SRE-driven model will reduce costs by:

  • Reducing the headcount required to provision and operate applications in the cloud, 
  • improving customer experience through better reliability,
  • optimizing resource consumption,
  • increasing automation,
  • and continuously and effectively ensure quality.

Introduction: The Cost Problem of Cloud Operations

Many organizations succeeded, partially or fully, to move their business-critical applications and infrastructure to the cloud. This transition has often been made in combination with the adoption of DevOps to empower development teams and target higher delivery velocity.

However, many of the same organizations are struggling to manage the costs of their new cloud- or hybrid-based operations departments and are looking for practical solutions.

In this article, we present Site Reliability Engineering (SRE) as a strategic initiative to drastically reduce the cost of cloud-based IT operations.

 

The Hidden Cost of the Sysadmin Approach in Cloud-Based IT Operation

“Sysadmin are tasked with running the service and responding to events and updates as they occur. [..] Direct costs are neither subtle nor ambiguous. Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system” (Google SRE Book, 2016).

As early as 2003, Google recognized that the sysadmin approach to running distributed and dynamic systems can become very expensive. Because moving to the cloud will introduce an infrastructure silo that needs to be additionally managed. Further, increasing the velocity of delivery will also create demand for additional sysadmin work due to more frequent releases.

Cloud-based architectures often involve multiple services, microservices, and distributed systems, which inherently increase the complexity of the infrastructure. This complexity arises from the need to manage various components, their interactions, and dependencies. Additionally, the dynamic nature of cloud environments, where resources can be scaled up or down based on demand, leads to a higher volume of events that need to be monitored and managed. This includes events related to resource provisioning, scaling, failures, and performance metrics

In other words, the needs towards the sysadmin team will exponentially increase as the complexity, velocity and managed elements and events generated by cloud and DevOps are increasing simultaneously.

 

Our Effective SRE and Operations Efficiency Approach

At Digital Architects Zurich, a cell of the Swiss Digital Network, we started considering this impact back in 2020 (see blog posts referenced below). Especially if you target high velocity and high reliability, which is the main promise of DevOps, keeping sysadmin approach to run and operate the systems will not scale.

Therefore, one of our main contributions in the Swiss market was to democratize SRE by building and deploying Effective SRE (e.g. see blog posts on the base Effective SRE methodology or specific capabilities for continuous verification or observability as well as our public talk at the Swiss Testing Day / DevOps Fusion in 2021) as a practical framework a new operating model.

The SRE is then a key role in a cloud operating model which will take essential responsibilities in driving the specification, design, testing, observability and operations towards proactive and cost-efficient assurance of service levels such as availability and performance through the pipeline and in operations.

The Effective SRE is responsible for assuring SLOs by:

  • Co-Building, Maintaining & Operating AI-driven CD/DevOps Pipeline (jointly with DevOps Teams)
  • Co-Building, Maintaining & Operating AI-driven IT Operations Management (Observability/Monitoring, AIOps, Alerting, ChatOps, …)
  • Co-Building, Maintaining & Operating the SRE cockpit & dashboards (incl. SLO-Monitoring-, CD-, & Emergency-Status Dashboards) 

In our Effective SRE definition, cost efficiency is one of the main objectives and “Operations Efficiency” is one of the three dimensions of the methodology alongside with “SLO Engineering” and “Continuous Delivery”. 

Cost Reduction Factors by an SRE-Driven Operating Model

The following is a list of concrete cost reduction impacts by an SRE-driven operating model. Note that this list is not exhaustive, and impact varies from environment to environment.

SLO Engineering

SLO Engineering: a systematic approach to specify and manage SLI/SLO by using advanced techniques such as Observability and AIOps on one hand and embedding reliability design and testing patterns while building and testing the system on the other hand.

Cost reduction impact:

  • Preventing the over-allocation of resources for unnecessary performance improvements, leading to better alignment of operational costs with business value.
  • The focus on SLO improves the quality of alerting which leads to better and faster incident response (lower MTTD / MTTR, see DORA metrics and note the introduction of Reliability as key metric).

Example: SREs might prioritize 99.5% uptime over 99.999% uptime if the business impact does not justify the additional expense.

Operations Efficiency through Automation

Operations Efficiency through Automation: One of the key principles of SRE is eliminating toil by automating repetitive and manual tasks such as incident analysis, triage, infrastructure provisioning, and test analysis.

Cost reduction impact:

  • AIOps, ML-driven Testing, or Infrastructure as a code (IaaC) are modern techniques that can be used to reduce the need for manual intervention, thereby minimizing human errors, optimizing labor costs, and increasing operational efficiency.

Example: Automated runbooks can handle common incidents, reducing the need for 24/7 operations teams, and automated root-cause analysis can drastically reduce the human effort required for incident management.

Continuous Delivery

Continuous Delivery: SRE enables Agile and DevOps teams to implement automated quality gate verification and deployment pipelines, which facilitate faster, more efficient deployments, resulting in fewer bugs, quicker rollbacks, and faster incident recovery.

Example: The headcount of FTEs and number of meetings required to make staging decisions collectively will be drastically reduced, and save labor costs of developers and managers.

Error Budgets

Finally, it is worth noting that introducing Error Budgets based on Reliability as a cost-balancing metric will help reduce the wasteful expenditure of resources on unnecessary service levels.

 

Conclusion: Adopting SRE in Your Organization

Once you understand how an SRE-driven operating model can assist IT Managers in lowering the expenses of cloud-based operations, the next logical question is: How do we implement and integrate an SRE-driven operating model for the hybrid or cloud-based environment within our organization?

The answer in one sentence: A culture-first holistic transformation is required to establish SRE as the new operating model!

Get in Touch to Learn More

Please let us know if you have comments or would like to understand how we can help you adopt Site Reliability Engineering. By leveraging our skills for culture-first transformation consulting, our expertise to upskill organizations through proven training, and our engineering capabilities, we can ensure the success of your transformation journey.

Write us at: info@digital-architects-zurich.ch

 

References

Subscribe to Our Newsletter