I’m hardly the first to notice there’s overlap... https://medium.com/devopslinks/how-to-monitor-the-sre-golden-signals-1391cadc7524 is a good starting point to read from. I haven’t seen these compressed to a single metric set yet, probably from not looking hard enough. Or because “DURSLEy” is too dumb for real pros.
- Duration: How long are things taking to complete?
- Utilization: How many resources are used?
- Rate: How many things are happening now?
- Saturation: How many resources are left?
- Latency: How long do things wait to start?
- Errors: Are there known problems?
- Yes: We’re done
These are popular metrics to monitor because they can be easily built up from existing sensors. They provide functional details of a service, in data that is fairly easy to derive information from.
In an ideal world, those metrics are measuring “things” and “resources” that are directly applicable to the business need. Sales made. Units produced.
In a less ideal world, machine readable metrics are often used as a proxy to value, because they are easier to measure. CPU load consumed. Amount of traffic routed.
In the best of all possible worlds, the report writer is working directly with business objectives. CAPS is a metric set that uses business level input to provide success indicators of a service, producing knowledge and wisdom from data and information.
- Capacity: How much can we do for customers now?
- Availability: Can customers use the system now?
- Performance: Are customers getting a good experience now?
- Scalability: How many more customers could we handle now?
These metrics present the highest value to the organization, particularly when they can be tied to insight about root cause and remediation. That is notably not easy to do, but far more valuable than yet another CPU metric.
Report writers can build meaningful KPIs and SLOs from CAPS metrics. KPIs and SLOs built from DURSLEy metrics are also useful, but they have to be used as abstractions of the organization’s actual mission.
Examples: the number of tents deployed to a disaster area is a CAPS metric, but any measure of resources consumed by deploying those tents is a DURSLEy metric. Synthetic transactions showing ordering is possible: CAPS. Load metrics showing all components are idle: DURSLEy.