Once our features are deployed and verified in production, we need to keep a close eye on how they perform. Monitor is the SAFe DevOps Health Radar activity that focuses on tracking system performance, end-user behavior, incidents, and business value. In this video, I walk through what monitoring involves and why it is essential for making the right decisions about your features.
Where Monitor Fits in the Pipeline#
The SAFe DevOps Health Radar starts with bright ideas from the customer or business. We extract a hypothesis, create an epic, collaborate and research to identify the real customer need, architect the minimal amount of architecture needed, and break the epic into features. We develop user stories, commit code, build deployable packages, test them, deploy to staging, and then deploy to the production environment. After verifying that everything is working in production, we need to monitor what is happening.
Why We Monitor#
Monitoring production means we can track features to understand system performance, identify incidents, observe end-user behavior, and measure the business value we deliver. We monitor because we want to ensure production runs as smoothly as possible. But there is more: we also want to validate the business hypothesis. Back in the hypothesis step, we defined what business value we expect. In the monitor step, we measure whether we actually deliver that value.
Full-Stack Telemetry#
To monitor effectively, we use full-stack telemetry. If you recall the architect step from this series, that is where we architect for operability. We decide what data needs to be logged. In the develop step, we implement all the logging. In the monitor step, we collect those log statements and feed them into a telemetry system so we can browse and analyze them.
It is important to log different types of data:
- Application data to track down technical issues and errors
- Business data to validate whether the business hypothesis holds true
Dashboards and Visual Displays#
Raw telemetry data is hard to read. That is why we use dashboards to visualize everything. These visual displays make it easy to interpret the information at a glance.
In these dashboards we can show:
- DevOps metrics such as last deployment, last outage, and average lead time
- End-user behavior to understand how features are being used
- Business value trends to track whether we deliver what we promised
It is important that the entire organization has access to these dashboards, not just the development team. Visibility across the organization enables better decisions.
Federated Monitoring#
Monitoring a single application in isolation is not enough. Applications have dependencies on other applications and on the underlying infrastructure. We need to consolidate all telemetry data into a federated monitoring platform that provides a holistic view.
Only with federated monitoring can we track down performance problems and issues across applications and across the infrastructure. This consolidated view is critical for understanding the full picture.
AIOps: Artificial Intelligence for IT Operations#
When we monitor multiple applications and infrastructure, the volume of data points, events, and alerts becomes overwhelming. We face a big data problem.
AIOps helps by:
- Aggregating data from all sources
- Correlating events across the entire application landscape
- Analyzing patterns to surface meaningful insights
- Predicting root causes so we can fix issues faster
- Detecting anomalies before they become incidents
AIOps tools visualize all dependencies in your application landscape and let you trace issues through the entire chain.
What Monitoring Produces#
When we monitor effectively, we gain several capabilities:
- Feature tracking: We can see if features are used and how end users interact with them
- System performance: We observe how the system performs in production, including API response times and resource consumption
- Incident prevention and analysis: Monitoring helps us prevent incidents and analyze them when they occur
- Business value measurement: We can measure whether the hypothesis from the beginning of the pipeline holds true, enabling the business to decide whether to invest more in a feature, keep it, or remove it
The Maturity Levels#
The SAFe DevOps Health Radar provides a maturity assessment for Monitor:
- Sit: No feature-level production monitoring exists. Only infrastructure monitoring is in place.
- Crawl: Features only log faults and exceptions. Analyzing events involves manually correlating logs from multiple systems.
- Walk: Features log faults, user activity, and other events. Data is analyzed manually to investigate incidents and measure business value of features.
- Run: Full-stack monitoring is in place. Events can be correlated throughout the architecture. Data is presented through system-specific dashboards.
- Fly: Federated monitoring platform provides one-stop access to full-stack insights. Data is used to gauge system performance and business value.
Closing the Feedback Loop#
The most important aspect of monitoring is that it closes the feedback loop. Back in the hypothesis step, we defined the business value we wanted to create. By tracking this business value in production, we enable the business to make the right decisions: Should we invest more into this feature or less? Should we enable this feature for all users or disable it entirely?
Monitoring is the crucial piece that allows us to build the right thing right.
Key Takeaways#
- Use full-stack telemetry. Log both application data and business data to get a complete picture.
- Visualize with dashboards. Make monitoring data accessible and easy to interpret for the entire organization.
- Implement federated monitoring. A single application view is not enough. You need to see across all applications and infrastructure.
- Leverage AIOps. When data volume exceeds human capacity, use AI to aggregate, correlate, and detect anomalies.
- Measure business value. Monitoring is not just about uptime. It is about validating whether the business hypothesis is true.
- Close the feedback loop. Use monitoring insights to make informed decisions about which features to keep, improve, or remove.
