Hi Jamie,
Thank you for your reply. We have a local cluster that gathers absolutely all telemetry (we do not filter out anything!) and then we alert only on few important things - this took a while to get right. I like your point of view from the SRE perspective, we tend to ignore that sometimes.
Just FYI community seems to be recognizing 3 different types of "important" metrics and that matches wit what you are saying:
1. The USE (Utilization / Saturation / Errors): great for low-level metrics and often used in the context of performance engineering and root cause analysis = SDLC school
2. The RED (Rate / Errors / Duration): focus on the number of requests served, the number of failed requests, and how long requests take; useful for many application-level cases = SRE school
3. USE+RED combined - latency, traffic, errors, and saturation etc.
Cheers