Ready to Work Together?
Let's discuss how our expertise can help transform your business.
Alex Podobnik
·
Mar 31, 2026
Observability is not monitoring. Monitoring tells you when something is wrong. Observability lets you ask arbitrary questions about your systems, including ones you didn't think to ask before the incident, and actually get answers. One gives you a dashboard that goes red. The other gives you a system you actually understand.
Most teams have dashboards. Very few have observability. This playbook documents the approach NextLink Labs takes with every infrastructure engagement: a deliberate, maturity-based progression through Grafana Cloud's full product suite, building real observability rather than a wall of panels nobody reads during an outage.
The guide is structured around five levels. Most teams want to skip straight to Level 3 or 4. The ones that do end up with traces that can't be correlated, alerts firing on the wrong signals, and SLOs built on unreliable data. The levels exist in this order for a reason. Work through them in sequence, each section covers the core Grafana Cloud product(s) involved and the concrete outcome you gain by completing that level.
The diagram below shows how all Grafana Cloud components connect: signal sources on the left feed into Grafana Alloy, which routes telemetry into the managed storage and analysis backends in Grafana Cloud. Dashboards, alerting, and the AI layer sit on top of that data.
Most teams think of Level 1 as the easy part. It's not. The decisions you make here about what to collect, how to label it, and how to structure your dashboards will follow you through every subsequent level. Get them right and everything compounds nicely. Get them wrong and you'll be fixing them under pressure at the worst possible time.
Alloy is the collector. Every signal in this stack, metrics, logs, traces, and profiles, flows through it. It replaced the Grafana Agent, and the move to a component-based pipeline model was a significant improvement. At Level 1 you're only using it for metrics, but how you configure it now determines how much work Levels 2 and 3 will take. By Level 3, the same Alloy instance picks up logs and traces too, one pipeline, three signals, but getting the foundation right is what makes that expansion painless.
Mimir is the managed Prometheus backend inside Grafana Cloud. If you already use Prometheus, your config, recording rules, and PromQL queries carry over without changes. The practical difference is that you stop worrying about federation, retention, and scaling and start actually using the data instead.
The most common Level 1 mistake we see is label cardinality. A team adds a dynamic label like user_id or request_id to their HTTP metrics and within 24 hours their series count has exploded, tens of thousands of unused, high-cardinality series driving up costs with zero operational value. Mimir will ingest it. Your bill will reflect it. Those values belong in logs and traces, not metric labels. A few other things worth getting right early:
Dashboards are where most teams start and where most teams stop. Level 1 is about building them correctly so they're still useful six months later when the person who built them has moved on.
Level 1 is done when your team can answer "is the system healthy right now" without opening five different tools. You've got a scalable metrics backend, dashboards that reflect reality, and the label and naming discipline that every subsequent level depends on. Every team that's skipped this foundation has eventually come back to fix it, usually under pressure, with worse data than they'd have started with.
Knowing that your error rate spiked is useful. Knowing why it spiked is what actually ends the incident. Level 2 closes that gap. When your error rate dashboard spikes, you should be able to pivot to the relevant log stream in one click, not by opening a separate tool and starting a manual search. We've sat in incident bridges where engineers are tabbing between Grafana, Kibana, and an SSH session trying to manually piece together what happened. Level 2 ends that.
Loki indexes only labels, not log content, the same idea Prometheus applied to metrics. That single decision makes it dramatically cheaper to run at scale than Elasticsearch, and it means your log labels and metric labels can be identical, which is what makes one-click correlation work. One thing worth saying plainly: Loki isn't a replacement for a SIEM or a compliance log archive. Teams that try to run it that way end up with both a cost problem and a query performance problem. Use it for operational logs, the ones you need during an incident, and put compliance data somewhere else.
A few configuration choices matter a lot here:
For the Alloy side of this: add a file log or Kubernetes pod log component alongside your existing metrics scrape components, apply the same label schema, and push to Loki. The critical thing is that your log labels match your metric labels exactly, same service names, same environment values, same region identifiers. Even a slight divergence breaks the one-click pivot from a metric spike to the relevant log stream. Parse and transform logs at the Alloy layer, not in your application code, so you can change your log schema without a service deployment.
What changes at Level 2 isn't the technology, it's how your team works. Engineers stop treating logs as a separate investigation step and start working with metrics and logs as one surface. Incidents that used to take 30 minutes to diagnose get resolved in under 5. That's where Level 2 pays for itself.
Traces are what most teams instrument last, which is the wrong order. By the time your architecture is mature, adding tracing is significantly harder than building it in from the start. What traces give you is the journey of a single request through your entire system: every service it touched, every database query it made, every external call it waited on. Combined with what you have from Levels 1 and 2, that's the ability to follow a request from browser click to database row and understand where time was actually spent.
Tempo does one thing differently from every other tracing backend: it doesn't index traces. Most backends charge based on indexed spans, which forces sampling decisions that mean you miss the traces you actually needed. Tempo stores everything and you find traces through correlation from metrics and logs. That shift is what makes long retention economically viable.
What this looks like in practice: a metric dashboard shows elevated latency. You click an exemplar and land directly on a representative trace. The trace shows 80% of the request time sitting in a database span. You click that span and pivot to the Loki logs for that service at that exact timestamp. There's a slow query warning right there. Root cause found in under three minutes, without switching tools or writing a single manual query.
Traces tell you which service is slow. Pyroscope tells you which function inside that service is causing it. It runs continuously in production, which sounds scary but the overhead is low enough that most teams forget it's there. The flame graphs it produces during an incident are usually worth more than anything else in your observability stack.
Level 3 changes how your engineers approach debugging. Instead of forming a hypothesis and hunting for evidence, you start from the evidence and work backwards. The question stops being "where should I look" and becomes "what does the data show." It's hard to appreciate until you've experienced it.
Levels 1 through 3 give you the visibility to understand your systems. Level 4 is where you formalize what good looks like and build the operational machinery to defend it. Service Level Objectives define your reliability targets. Alerting fires when those targets are at risk. OnCall ensures the right person gets notified. This is also the level where most teams discover they've been doing alerting wrong: hundreds of alerts, no hierarchy, no escalation paths, and engineers who've learned to ignore the noise.
Grafana's alerting engine is Prometheus-compatible and deeply integrated with the full Grafana Cloud stack. Alert rules can query any data source, including Mimir, Loki, and Tempo, and fire via Alertmanager routing to any notification channel.
OnCall is where alerts become someone's problem to solve. It manages schedules, escalation chains, and integrations with Slack, PagerDuty, and SMS. The part most teams skip is building the escalation logic before they need it. A schedule with only one person on it isn't an escalation policy.
Most teams have never written down what "good" means for their services. SLOs force that conversation. Grafana Cloud's built-in SLO feature lets you define objectives, track error budgets, and alert on burn rate. The error budget is the most useful part: it gives product and engineering a shared number to argue about instead of arguing about feelings. A 99.9% availability SLO over 30 days gives you 43 minutes of allowable downtime per month, if you've already burned 40 of those minutes, every deployment is a real conversation.
Level 4 is where observability stops being an engineering concern and becomes an organizational one. The clearest sign a team has actually internalized it is when reliability stops being debated in incident retrospectives and starts getting planned for in sprint planning.
The first four levels are reactive: you observe what's happening and respond. Level 5 is where that flips. You validate system behavior before users encounter it, you have visibility into the browser and frontend experience, and you catch performance regressions in CI/CD before they reach production. Teams at Level 5 fix problems their users never see. Most engineering teams never get here, not because it's technically hard, but because Level 4 never gets finished properly and there's always something more urgent. That's the argument for building this way from the start.
Synthetic Monitoring runs scheduled probes against your endpoints from Grafana's global probe network, HTTP checks, DNS resolution, TCP connectivity, ping, and scripted browser tests, all running continuously from the perspective of a user in any geographic region.
k6 is Grafana's load testing tool, and the integration with Grafana Cloud is what makes it worth using here. Your load test results show up as metrics in Mimir alongside your production data, when you're running a test, you're watching the same dashboards your team uses during incidents, not a separate tool.
Most backend observability tells you what your servers did. Faro tells you what your users experienced. It's a browser SDK that captures errors, performance data, and traces from the frontend and feeds them into the same Grafana Cloud stack as everything else. The most useful thing it does is connect a frontend user action to the backend trace it triggered.
At Level 5, your team stops reacting and starts preventing. Synthetic monitoring catches endpoint failures before real users hit them. k6 catches performance regressions before they reach production. Faro surfaces browser-side issues that backend metrics can't see. Pyroscope shows you continuously where your services are actually spending time. Between them, you're finding and fixing problems before your users see them. That's the goal.
The table below summarizes the five levels covered in this playbook. Most organizations arrive at NextLink Labs at Level 1 or 2. The goal of an engagement is to reach Level 4 where reliability is formalized and operationally defended, and to have a clear path to Level 5.
What the table doesn't show is compounding. Level 3 traces are only useful if your Level 1 metric labels are consistent enough to correlate them. Level 4 SLOs are only meaningful if your Level 2 log data is structured enough to explain burn rate events. Skipping levels doesn't save time. You're taking on debt you'll pay later, usually with worse data and more pressure.
One question we get regularly is why this playbook uses Grafana Cloud rather than a self-hosted stack. When you run your own Grafana, Mimir, Loki, and Tempo, a real portion of your engineering time goes into operating those systems rather than using them. We've seen teams spend weeks on Mimir compaction issues, Loki retention configuration, and Tempo backend scaling. Grafana Cloud removes that entirely. Pre-built integrations mean you're collecting real signals within minutes of signing up, not days. That speed matters when you're trying to move through these levels while also shipping products.
NextLink Labs is a platform engineering and DevOps consultancy. We work with engineering teams who are modernizing infrastructure or trying to build real observability rather than just dashboards. The practices in this playbook come from real client engagements, not theory.
If you want an honest assessment of where your stack sits on this maturity model, or help moving through the levels faster than you could on your own, reach out at nextlinklabs.com.
Author at NextLink Labs
Stop accumulating CVEs silently. Learn how NextLink Labs uses Docker Hardened Images to eliminate base image maintenance overhead, reduce attack surface, and stay compliant — automatically.
Alex Podobnik
·
Mar 26, 2026
Transform your security event monitoring from reactive alerts to proactive threat detection with structured strategies for effective DevSecOps maturity.
Jordan Saunders
·
Mar 24, 2026
Ensure your applications are fast, reliable, and user-centric with our strategic guide to Application Performance Monitoring (APM). Learn how to move beyond basic metrics to drive real business value, optimize performance, and enhance digital experiences.
Jordan Saunders
·
Mar 13, 2026
Let's discuss how our expertise can help transform your business.