Skip to main content
What is Respond? | SAFe DevOps Health Radar
  1. Blogs/

What is Respond? | SAFe DevOps Health Radar

Author
Romano Roth
I believe the next competitive edge isn’t AI itself, it’s the organisation around it. As Chief AI Officer at Zühlke, I work with C-level leaders to build enterprises that sense, decide, and adapt continuously. 20+ years turning this conviction into practice.
Ask AI about this article

How do you proactively detect and fix production issues before they cause a business disruption? Respond is the SAFe DevOps Health Radar activity that answers exactly this question. In this video, I walk through what Respond involves and why it is essential for maintaining a stable production environment.

Where Respond Fits in the Pipeline
#

The SAFe DevOps Health Radar starts with the customer. Bright ideas are turned into hypothesis statements and epics. In Collaborate and Research we identify the real customer need. We create the minimal architecture needed, break epics into features, develop user stories, commit code, build deployable artifacts, test end-to-end, deploy to staging, and then deploy continuously to production. After verifying the production deployment and monitoring the environment, we arrive at Respond.

In the Respond step, we want to proactively detect and resolve production issues before they can cause any business disruption.

Why Respond Matters
#

Production issues happen. Even the big players like Google, Facebook, and Amazon experience outages. Production problems always affect the customer, and they bind resources. When something breaks, your team has to create fixes, patches, redeploy, and retest. All of that reduces the flow of future value into production.

That is why proactively responding to incidents is critical.

Proactive Detection
#

In the Monitoring step (covered earlier in this series), we set up telemetry and logging systems. Because we have all of this telemetry data in place, we can create tolerance thresholds that alert us to dangerous conditions.

A good notification strategy is essential here. We should not alert for everything. The thresholds must be carefully evaluated so that we create notifications only for truly dangerous conditions. Otherwise, we risk alert fatigue, where the team stops paying attention because there are too many false alarms.

Disaster Recovery Rehearsal
#

Disaster will always strike in production. That is why we need to rehearse disaster recovery procedures on a regular basis. When a real incident occurs, the team must already know the procedures and be able to execute them quickly.

Cross-Team Collaboration
#

What we do not want is the blame game: the customer calls the service desk, the service desk sends the issue to the developer, the developer points to the product owner, the product owner sends it to the tester, and everyone is finger-pointing.

What we want instead is cross-team collaboration. When something happens in production, everybody works together to detect, analyze, and resolve the issue. After each incident, we conduct an incident post-mortem meeting where everyone involved identifies what can be improved so the incident either never recurs or can be resolved much faster next time.

Immutable Infrastructure
#

Having an immutable infrastructure in production is very helpful. In an immutable infrastructure, no one can make direct changes to production. All changes must go through the continuous delivery pipeline.

Infrastructure as Code supports this approach because all environment configuration is stored in version control as configuration files. This ensures that every change is traceable and reproducible.

Everything Under Version Control
#

To respond effectively to incidents, everything must be under version control: all code, all infrastructure configuration, all tests, all test data, all requirements. Only then can we analyze incidents, see what changed, when, and why. And only then can we make fast rollbacks when needed.

Rollback vs. Fix Forward
#

When a production issue occurs, we have two main strategies:

  • Rollback: Get the previous working version from source control and install it again in production. This is fast and restores stability immediately.
  • Fix Forward: Analyze the problem as fast as possible, create a fix, and deliver that fix through the full continuous delivery pipeline into production.

Both strategies have their place, and the right choice depends on the severity of the issue and how quickly a fix can be developed.

Session Recording
#

Systems are getting more and more complex, and users go through many different workflows. Session recording allows us to store everything a user does, so we can replay the entire session in a test or development environment to reproduce and debug issues.

When using session recording, you must pay close attention to data security, privacy, and retention policies, as you are recording a lot of user data.

The Maturity Levels
#

The SAFe DevOps Health Radar provides a maturity assessment for Respond:

  • Sit: Customers find issues before we do. Resolving high priority issues is time-consuming and reactive. Customers have low confidence in our ability to recover from production issues.
  • Crawl: Operations owns production issues. Development involvement requires significant escalation. Teams blame each other in times of crisis.
  • Walk: Development and Operations collectively own the incident resolution process. Recovering from major incidents is reactive but a team effort.
  • Run: Our monitoring systems detect most issues before our customers do. Dev and Ops work proactively to recover from major incidents.
  • Fly: Our monitoring systems alert us to dangerous conditions based on carefully designed tolerance thresholds. Developers are responsible for supporting their own code and proactively issue fixes through the pipeline before users are affected.

What Respond Produces
#

The output of the Respond step is:

  • A stable production environment that ensures business continuity
  • A notification and alerting system based on carefully evaluated thresholds that alerts the team to dangerous conditions
  • The ability to release on demand, which is covered in the next video in this series

Key Takeaways
#

  • Production issues are inevitable. Even the biggest companies experience outages. Prepare for it.
  • Set up proactive detection. Use telemetry data and carefully designed thresholds to alert on dangerous conditions, not on everything.
  • Rehearse disaster recovery. Do not wait for a real incident to test your recovery procedures.
  • Collaborate across teams. Replace the blame game with cross-team incident resolution and post-mortem meetings.
  • Use immutable infrastructure. All changes should go through the continuous delivery pipeline, never directly in production.
  • Keep everything in version control. Code, configuration, tests, and requirements must all be versioned to enable fast analysis and rollbacks.