What is Stabilize? | SAFe DevOps Health Radar

Ask AI about this article

After we release a new feature to our users, we need to make sure everything runs smoothly. Stabilize is the SAFe DevOps Health Radar activity that focuses on maintaining a high level of business continuity so we can continuously deliver value to our customers. In this video, I walk through what Stabilize involves and why it is essential for a stable, resilient production environment.

Where Stabilize Fits in the Pipeline
#

The SAFe DevOps Health Radar starts with bright ideas from the customer or business. We extract a hypothesis, create an epic, collaborate and research to identify the real customer need, architect the minimal amount needed, and break the epic into features. We develop user stories, commit code, build deployable artifacts, test end-to-end, deploy to staging, and then deploy to production with the feature toggle off. After verifying in production and monitoring, we respond to any incidents. When the time is right, we switch the feature toggle on and release the feature to users.

Now comes Stabilize. The feature is live and we need to assure that everything continues to work reliably.

Architecting for Operability
#

If you recall the Architecture step from earlier in the series, that is where we architect for operability. All of those decisions come together in Stabilize. If we did not build proper operational capabilities into our system, we will have a hard time now.

Good operability means:

Comprehensive logging that captures all the data we need during stabilization
Telemetry to observe system behavior in real time
Feature toggles to switch features on and off when needed
Recovery mechanisms so we can respond quickly when something goes wrong

Disaster Recovery
#

Failures will happen in production. Even the big players like Google, Amazon, and Facebook have experienced disastrous situations. That is why a disaster recovery strategy is essential.

Your features need to be designed to support disaster recovery. In a disaster scenario, you should be able to switch off recently deployed features to isolate the problem. Most importantly, the disaster recovery strategy must be rehearsed often, and ideally automated and tested regularly.

Proactive Detection and Cross-Team Collaboration
#

When we have a proper monitoring system in place, we can create alerts on dangerous thresholds. If a threshold is reached, the team gets notified and cross-team collaboration begins. The whole team working across the value stream analyzes the problem together, not against each other, and solves it together.

After every incident, we conduct an incident post-mortem. We identify what we can do to prevent such an incident from happening again and implement measures accordingly.

Security in Production
#

In the Build step, we already scan for security vulnerabilities in our application code and libraries. But scanning only newly created code is not enough. Code that already runs in production also needs continuous attention.

We use Security Information and Event Management (SIEM) systems to provide real-time analysis of security alerts. This way we continuously test our running services for vulnerabilities, attacks, and intruders.

Monitoring Non-Functional Requirements
#

Back in the Test End-to-End step, we defined our non-functional requirements and set up automated tests for them. In Stabilize, we continuously monitor all of these requirements: reliability, performance, maintainability, scalability, usability, and more.

Non-functional requirements are constraints on every backlog item. Everything we change in the system must comply with them. That is why continuous monitoring is critical.

SLIs, SLOs, and SLAs
#

Understanding service level indicators, objectives, and agreements is crucial for operating a system effectively.

Service Level Indicator (SLI): A percentage of an important metric measured against a specific criterion. For example: in the 95th percentile, response time must be below 400 milliseconds on a given interface.

Service Level Objective (SLO): The percentage of an SLI your team must hit over a certain period. For example: the 95th percentile response time must be below 400 milliseconds in 90% of cases over the next 30 days.

Service Level Agreement (SLA): The agreement with clients and users that defines consequences when an SLO is breached. For example: if the SLO is breached, penalties apply or customers are lost.

Site Reliability Engineering
#

SRE was introduced around 2004 by Google. Site Reliability Engineers are highly skilled, T-shaped engineers who are strong in both development and operations. They maintain highly scalable and highly reliable systems.

The relationship between SRE and DevOps depends on your availability targets:

Five nines (99.999%): Only 864 milliseconds of downtime per day, 5.26 minutes per year. SREs are best suited here.
Four nines (99.99%): SREs add significant value at this level as well.
Three nines and below: DevOps practices are typically sufficient.

The higher the availability target, the more challenging the engineering becomes. SREs bring the specialized skills needed for those demanding requirements.

The Maturity Levels
#

The SAFe DevOps Health Radar provides a maturity assessment for Stabilize:

Sit: We experience frequent unplanned outages and/or security breaches with long recovery times.
Crawl: We experience occasional unplanned outages but recover within our service level agreements.
Walk: We have very few unplanned outages. Availability, security, and disaster recovery measures are effective.
Run: We have no unplanned outages. We plan and rehearse failure and recovery.
Fly: We maximize resiliency by deliberately injecting faults into our production environment and rehearsing recovery procedures.

What Stabilize Produces
#

The output of the Stabilize activity is a production environment that is:

Stable and resilient
Reliable and available
Supportable and maintainable
Secure against vulnerabilities and attacks

All of this while meeting the SLOs defined in the SLAs. With Stabilize in place, we achieve the high level of business continuity needed to continuously deliver value to our customers.

Key Takeaways
#

Architect for operability early. Logging, telemetry, feature toggles, and recovery mechanisms must be built in from the start.
Have a disaster recovery strategy. Rehearse it often and automate where possible. Failures will happen.
Monitor non-functional requirements continuously. Reliability, performance, and security are not one-time concerns.
Understand SLIs, SLOs, and SLAs. They define the operational expectations and consequences for your system.
Use SRE for high availability targets. For five and four nines, Site Reliability Engineering brings essential expertise.
Inject faults deliberately. The highest maturity level means proactively testing your system’s resilience in production.

Where Stabilize Fits in the Pipeline#

Architecting for Operability#

Disaster Recovery#

Proactive Detection and Cross-Team Collaboration#

Security in Production#

Monitoring Non-Functional Requirements#

SLIs, SLOs, and SLAs#

Site Reliability Engineering#

The Maturity Levels#

What Stabilize Produces#

Key Takeaways#