Scaling metrics and metrics governance at enterprise level

Greenhorn

Posts: 12

posted 1 year ago

Number of slices to send:

Optional 'thank-you' note:

Send

Welcome, Jamie!

One of the challenge I see when adopting observability and metrics is how to align them across services and in enterprise companies with 100+ services and products, with teams in different geographies and different programming languages.

What is your perspective on the best way to align around common metrics and conventions in large companies? What are the best 3 techniques/mechanism to ensure proper governance and alignment?

Thank you,
Lucian

Jamie Riedesel

Author

Posts: 26

I like...

posted 1 year ago

1
Number of slices to send:

Optional 'thank-you' note:

Send

There are several issues at play when revising metrics and observability practices in a large organization. Frankly, I spent several hundred pages talking about the process of doing that. However, the main issues are:

Gaining consensus on the goals of the metrics or observability system. Without consensus, there is no alignment, and without alignment, there can only be local improvements.

The capabilities of your observability systems are the top limiting factor for how generalized a solution you can agree to. If you can change your capabilities (build new or buy new) that changes the conversation in radical ways.

Culture of the company. If they're old-school VM-based stuff, taking observability practices from entirely containerized/kuberized environments will take some translation.

Finally, for large companies, making changes at this level is definitely not in the scope of the average individual contributor. You need be Principal, Staff, Architect or equivalent to start the conversation, and will need management backing for this project to be a success. Only in the most engineering-driven companies can a super-senior engineer force this kind of change on that many teams. This is hard work, and can absolutely take a year or more (especially if you're planning to build/buy a new system).

The first project phase is to gain consensus that observability is actually a problem. Without that consensus, nothing will actually change. If you are a lower level engineer who is feeling the pain, pitch the problems to more senior engineers, and work with them to gain management buy-in.

After you have that consensus get management buy-in that this problem needs a solution. A project like this will touch every single in-use codebase in the company, so this may need to come from quite high up the org-chart. This is where the consensus building in the previous phase pays off; hopefully you've gotten enough critical mass of stakeholders to convince the budgetary powers-that-be to make room for this kind of project.

Next is build the team to specify the problem space and workshop solutions. This will be a cross-disciplinary team, especially in a large company with that many codebases. If there are different programming languages in use, this gets extra fun. Languages that are heavily used in container spaces, like Go, have significantly different support than older languages like PHP, which constrains your solutions. Also, this is where you figure out the limitations you're working under. If that's too limited for your goals, perhaps the metrics system you already have is severely cardinality constrained and that's where most of the problems are being felt, then maybe build/buy new needs to get planned.

Once you have the problem space, determine build/buy decisions. This is where big company planning comes into play. You'll need programmer capacity to build all of the library changes needed to support your new system, operator time to potentially build new telemetry systems, and possibly approved budget to buy new software/SaaS-subscriptions.

After you know where you're going, it's down to the brass tacks of developing a metrics schema that works for everyone, agreeing on retention/aggregation periods, and how to handle exceptions. Every org is different, and this is high politics at its worst. Every team in the company is already doing all of these things and are being asked to change to adopt the centralized standards. Teams that were already pretty close to that won't find much trouble. Teams that were well off the new standard will have more work ahead of them and may stall/stonewall to avoid doing all the work.

Finally, keep at it. There are always stragglers. Projects of this scope tend to be of the duration that by the time the project gets to done, the state of telemetry has changed a lot again, and you might need to go back to the beginning. So, plan for ongoing maintenance from the get-go.

"With great responsibility, comes great paperwork."
- From the SOC2 Testament, Book of Traceability.

Lucian Revnic

Greenhorn

Posts: 12

posted 1 year ago

Number of slices to send:

Optional 'thank-you' note:

Send

Many thanks, Jamie for the detailed perspective and constructive suggestions.

To do a great right, do a little wrong - shakepeare. twisted little ad:

a bit of art, as a gift, that will fit in a stocking

https://gardener-gift.com