Sunday, March 4, 2018

Metrics in Splunk, and Observability

Also posted as a Twitter thread
  • I've got some thoughts about Splunk and metrics for observability...
  • The event-first Splunk can now store metrics efficiently. That has potential: 1 dashboard, a single glass of pain.
  • I'm excited to see annotations and mcatalog; I'm hoping it allows resolution of a nasty problem with multi-source metric comparison.
  • Metrics are quantitative. "Your volume has N bytes free". Good? Bad? Quantitative metrics are almost entirely useless for decisions.
  • (I actually think they are useless. A triggered  metric like "DISK FULL, 10 periods" is an event, not a metric. Splitting hairs.)
  • Decisions from metrics need qualitative context. "Allocate more space now or later?" "How much more?" "What about budget & schedule?"
  • Quantitative data, qualitative context, quantitative decision. If the context is only in humans, then humans need training to use it.
  • "Fellow human, I teach you tool's contextual framework. It emits X metric, Y units, Z interval. Normal = A-S today. If X>N, runbook!"
  • Encode that into a KPI? Hasn't improved anything. Still breaks when change means normal is wrong. Human has to know context to fix.
  • Compare many KPIs? Not even feasible without qualitative metrics. "Q: Need more storage?" Looks at 4-tier hybrid hierarchy, "A: ???"
  • In Metrics Store's catalog, seems that unit size is unknown, but there’s periodicity & granularity? If the source gathered them?
  • Why don't sources just send context? Tools should compute useful values & compare metrics qualitatively. "Tier 3 is 95% full."
  • Contextual decisions could be automated. "Usage will exceed capacity during your vacation, I think we should buy more space now."
  • Data system problems could be seen. "Dashboard expects 15 metrics/period, now getting 3 from 1/6 of probes, & 1 OutOfCheese Error."
  • Answer to "Why don't you just" questions is "Why should I". Splunk can answer that. Where's CIM for Metrics? Real attributes and KPIs?
  • Determining importance of a metric needs context. "Disk full" is pitifully primitive. A service provider or vendor knows better KPIs.
  • Sure would be nice to have vendor-specific tools for detailed analysis and role-specific tools with Splunk awareness metrics.