Observability Best Practices with Grafana Cloud

Written by Alex Podobnik | Mar 31, 2026 7:35:12 PM

Introduction

Observability is not monitoring. Monitoring tells you when something is wrong. Observability lets you ask arbitrary questions about your systems, including ones you didn't think to ask before the incident, and actually get answers. One gives you a dashboard that goes red. The other gives you a system you actually understand.

Most teams have dashboards. Very few have observability. This playbook documents the approach NextLink Labs takes with every infrastructure engagement: a deliberate, maturity-based progression through Grafana Cloud's full product suite, building real observability rather than a wall of panels nobody reads during an outage.

The guide is structured around five levels. Most teams want to skip straight to Level 3 or 4. The ones that do end up with traces that can't be correlated, alerts firing on the wrong signals, and SLOs built on unreliable data. The levels exist in this order for a reason. Work through them in sequence, each section covers the core Grafana Cloud product(s) involved and the concrete outcome you gain by completing that level.

Reference Architecture

The diagram below shows how all Grafana Cloud components connect: signal sources on the left feed into Grafana Alloy, which routes telemetry into the managed storage and analysis backends in Grafana Cloud. Dashboards, alerting, and the AI layer sit on top of that data.

Level 1: Visibility

Most teams think of Level 1 as the easy part. It's not. The decisions you make here about what to collect, how to label it, and how to structure your dashboards will follow you through every subsequent level. Get them right and everything compounds nicely. Get them wrong and you'll be fixing them under pressure at the worst possible time.

Grafana Alloy

Alloy is the collector. Every signal in this stack, metrics, logs, traces, and profiles, flows through it. It replaced the Grafana Agent, and the move to a component-based pipeline model was a significant improvement. At Level 1 you're only using it for metrics, but how you configure it now determines how much work Levels 2 and 3 will take. By Level 3, the same Alloy instance picks up logs and traces too, one pipeline, three signals, but getting the foundation right is what makes that expansion painless.

Deploy Alloy close to the source: sidecar in containers, DaemonSet in Kubernetes, system service on bare metal. Nobody scrapes everything remotely and calls that a collection strategy.

Version your Alloy config in Git and deploy it through CI/CD. We've had clients where staging and production were collecting completely different signals because someone edited a config file directly on a server. That's a process problem, not a monitoring problem.

The label schema you define here flows into every metric, log line, and trace span across your entire stack. Get it right before you have many services. Retrofitting it later is one of those jobs that always takes three times longer than anyone expects.

Build reusable collection components rather than one big monolithic config. A standard Kubernetes pod scrape block you write once should work across every cluster. If you're copying and pasting config between environments, you'll eventually have a drift problem.

Test config changes in staging before rolling to production. A wrong label or misconfigured scrape interval can explode your series count overnight, and finding that out at 2am is avoidable.

Grafana Mimir

Mimir is the managed Prometheus backend inside Grafana Cloud. If you already use Prometheus, your config, recording rules, and PromQL queries carry over without changes. The practical difference is that you stop worrying about federation, retention, and scaling and start actually using the data instead.

The most common Level 1 mistake we see is label cardinality. A team adds a dynamic label like user_id or request_id to their HTTP metrics and within 24 hours their series count has exploded, tens of thousands of unused, high-cardinality series driving up costs with zero operational value. Mimir will ingest it. Your bill will reflect it. Those values belong in logs and traces, not metric labels. A few other things worth getting right early:

Turn on Adaptive Metrics immediately. It automatically drops the time series nobody is querying. We've seen series counts cut by over 60% on the first run with zero impact on any dashboard. The smaller footprint makes everything faster, and it costs less.

Agree on a naming convention before you have ten services, not two hundred. We use namespace_subsystem_name_unit across all engagements, for example http_server_request_duration_seconds. It seems pedantic until you're trying to build a cross-service dashboard at midnight and every service has named things differently.

If your dashboards are computing complex expressions against raw metrics on every load, they'll slow down as your data grows. Pre-aggregate with recording rules at ingest time and your dashboards stay fast regardless of data volume.

Not everything needs to be scraped at the same interval. Infrastructure metrics every 15 seconds makes sense. Business KPIs at that frequency are just burning write capacity. Match the interval to how often you actually need to act on the signal.

Grafana Dashboards

Dashboards are where most teams start and where most teams stop. Level 1 is about building them correctly so they're still useful six months later when the person who built them has moved on.

Use RED for services (Rate, Errors, Duration) and USE for infrastructure (Utilization, Saturation, Errors). Pick one and stick to it per dashboard. Mixing them makes things harder to read when it matters most, which is during an incident.

Three tiers: fleet overview, service overview, and instance drilldown. A flat structure where everything lives on one dashboard means every incident starts with the same question: where do I even begin? Build all three from the start.

Never hardcode environment, region, or cluster names. Use template variables. A dashboard that only works in production is one you'll clone for staging, then for every new region, and then spend the next year keeping in sync.

Store dashboard JSON in Git. If your dashboards only exist in the UI, they'll drift, they'll break, and when something goes wrong you'll have no way to understand what changed. The Grafana Terraform provider makes this straightforward.

Eight to twelve panels per dashboard. More than that and you've built a report, not a dashboard. Every panel should answer a specific operational question, if you can't articulate the question, the panel shouldn't be there.

Level 1 is done when your team can answer "is the system healthy right now" without opening five different tools. You've got a scalable metrics backend, dashboards that reflect reality, and the label and naming discipline that every subsequent level depends on. Every team that's skipped this foundation has eventually come back to fix it, usually under pressure, with worse data than they'd have started with.

Level 2: Context

Knowing that your error rate spiked is useful. Knowing why it spiked is what actually ends the incident. Level 2 closes that gap. When your error rate dashboard spikes, you should be able to pivot to the relevant log stream in one click, not by opening a separate tool and starting a manual search. We've sat in incident bridges where engineers are tabbing between Grafana, Kibana, and an SSH session trying to manually piece together what happened. Level 2 ends that.

Grafana Loki

Loki indexes only labels, not log content, the same idea Prometheus applied to metrics. That single decision makes it dramatically cheaper to run at scale than Elasticsearch, and it means your log labels and metric labels can be identical, which is what makes one-click correlation work. One thing worth saying plainly: Loki isn't a replacement for a SIEM or a compliance log archive. Teams that try to run it that way end up with both a cost problem and a query performance problem. Use it for operational logs, the ones you need during an incident, and put compliance data somewhere else.

A few configuration choices matter a lot here:

Keep labels small and static: environment, service, region, pod. Everything dynamic, request IDs, user IDs, error messages, belongs inside the log line itself. Dynamic label values cause the same cardinality explosion you saw with Mimir.

Get your services logging structured JSON before you set up Loki. You can parse freeform logs with LogQL but it's slow and brittle. A structured log with a status_code field you can filter on directly is worth far more than a freeform string you have to regex through at 3am.

Debug logs from a busy service can cost more than all your other signals combined. Adaptive Logs drops them at the Alloy layer before they hit storage, no application code changes required, and you don't lose anything operationally useful.

Always filter by label before you pattern-match in LogQL. Loki loads all streams matching your labels first, then filters the content. A query with no label filter scans everything, and on a busy system that's both slow and expensive.

Set up derived fields to extract trace IDs from your log lines and turn them into clickable links to Tempo. This one configuration step is what makes three-signal correlation actually usable rather than just theoretically possible.

For the Alloy side of this: add a file log or Kubernetes pod log component alongside your existing metrics scrape components, apply the same label schema, and push to Loki. The critical thing is that your log labels match your metric labels exactly, same service names, same environment values, same region identifiers. Even a slight divergence breaks the one-click pivot from a metric spike to the relevant log stream. Parse and transform logs at the Alloy layer, not in your application code, so you can change your log schema without a service deployment.

What changes at Level 2 isn't the technology, it's how your team works. Engineers stop treating logs as a separate investigation step and start working with metrics and logs as one surface. Incidents that used to take 30 minutes to diagnose get resolved in under 5. That's where Level 2 pays for itself.

Level 3: Depth

Traces are what most teams instrument last, which is the wrong order. By the time your architecture is mature, adding tracing is significantly harder than building it in from the start. What traces give you is the journey of a single request through your entire system: every service it touched, every database query it made, every external call it waited on. Combined with what you have from Levels 1 and 2, that's the ability to follow a request from browser click to database row and understand where time was actually spent.

Grafana Tempo

Tempo does one thing differently from every other tracing backend: it doesn't index traces. Most backends charge based on indexed spans, which forces sampling decisions that mean you miss the traces you actually needed. Tempo stores everything and you find traces through correlation from metrics and logs. That shift is what makes long retention economically viable.

Use OpenTelemetry. Tempo is OTel-native and the SDK is vendor-neutral, so you're not locked in to Grafana's implementation. Instrument with OTel today and you can route to any compliant backend tomorrow without re-instrumenting your services.

Start at service boundaries. Inbound requests, outbound calls, database queries, queue interactions: these all need to be spans. Function-level tracing can come later, get the architecture-level picture right first.

Exemplars are underused. An exemplar is a trace ID embedded directly in a metric data point. When latency spikes in Mimir, clicking the exemplar lands you on a real trace for that spike with no manual searching. Set them up early.

Learn TraceQL. Most people use Tempo only for individual trace lookup and miss the most powerful part of it. Finding every trace where a specific downstream service took over 500ms is a one-line TraceQL query, that's not possible with logs.

Keep 100% of error and slow traces. Sample fast, successful ones at a lower rate. The traces you care about most are the ones where something went wrong, and you can never predict which requests those will be in advance. Tail-based sampling lets you make that decision after the fact.

What this looks like in practice: a metric dashboard shows elevated latency. You click an exemplar and land directly on a representative trace. The trace shows 80% of the request time sitting in a database span. You click that span and pivot to the Loki logs for that service at that exact timestamp. There's a slow query warning right there. Root cause found in under three minutes, without switching tools or writing a single manual query.

Grafana Pyroscope

Traces tell you which service is slow. Pyroscope tells you which function inside that service is causing it. It runs continuously in production, which sounds scary but the overhead is low enough that most teams forget it's there. The flame graphs it produces during an incident are usually worth more than anything else in your observability stack.

Production profiling isn't optional. The performance problems that show up in production almost never reproduce in development because the data volumes, concurrency patterns, and configuration are all different. Profile where the problem actually lives.

When a trace span shows a service taking 800ms and you don't know why, click through to Pyroscope. It'll show you the flame graph for that service during that exact time window. In our experience, the slow function is usually obvious on first look.

Before scaling up a service, profile it. Half the time what looks like a resource constraint is a hot loop or an inefficient query that a bigger instance won't fix. Profile first, resize after.

Start with CPU and heap allocation. They account for most performance issues. Goroutine, mutex, and I/O profiles are useful but they're the second layer, not the first.

Level 3 changes how your engineers approach debugging. Instead of forming a hypothesis and hunting for evidence, you start from the evidence and work backwards. The question stops being "where should I look" and becomes "what does the data show." It's hard to appreciate until you've experienced it.

Level 4: Reliability

Levels 1 through 3 give you the visibility to understand your systems. Level 4 is where you formalize what good looks like and build the operational machinery to defend it. Service Level Objectives define your reliability targets. Alerting fires when those targets are at risk. OnCall ensures the right person gets notified. This is also the level where most teams discover they've been doing alerting wrong: hundreds of alerts, no hierarchy, no escalation paths, and engineers who've learned to ignore the noise.

Grafana Alerting

Grafana's alerting engine is Prometheus-compatible and deeply integrated with the full Grafana Cloud stack. Alert rules can query any data source, including Mimir, Loki, and Tempo, and fire via Alertmanager routing to any notification channel.

Alert on symptoms, not causes. Page on "checkout latency exceeds 2 seconds for 5 minutes" rather than "CPU exceeds 80%." Symptom-based alerts are actionable. Cause-based alerts just generate noise.

Three tiers: informational gets logged, warning gets a ticket, critical pages someone. The reason most alert systems stop working after six months is that every alert is critical. When everything is urgent, nothing is.

Alert against recording rules, not raw metric expressions. A complex multi-series condition that takes two seconds to evaluate normally will time out during a high-cardinality incident, which is precisely when you need it. Pre-compute it.

Set up inhibition rules before your first major outage. One dead host without them generates one alert per affected service, and your on-call engineer spends the first ten minutes acknowledging noise instead of fixing the problem.

Every alert needs a duration window. A condition that holds for five minutes is an incident. A condition that holds for five seconds is noise. False positive alerts aren't harmless, they train your team to be slower to respond.

Grafana OnCall

OnCall is where alerts become someone's problem to solve. It manages schedules, escalation chains, and integrations with Slack, PagerDuty, and SMS. The part most teams skip is building the escalation logic before they need it. A schedule with only one person on it isn't an escalation policy.

On-call schedules need to exist before you go to production, not the week after your first 3am incident. An untested schedule is a plan that will fail at the worst possible moment.

Route alerts by service ownership, not just severity. The person who built the checkout service should receive checkout alerts, not whoever is on a global SRE rotation with no context for the system.

Your escalation chain should assume the primary on-call won't respond. People miss pages. Phones die. Five minutes of no acknowledgement should escalate automatically, it's not a reflection on your team, it's just how incidents work.

Connect OnCall to Slack. Grafana Assistant works directly inside Slack channels, so your team can pull metrics, logs, and traces without leaving the incident channel. Reducing context switching during an incident is worth more than it sounds.

Grafana SLOs

Most teams have never written down what "good" means for their services. SLOs force that conversation. Grafana Cloud's built-in SLO feature lets you define objectives, track error budgets, and alert on burn rate. The error budget is the most useful part: it gives product and engineering a shared number to argue about instead of arguing about feelings. A 99.9% availability SLO over 30 days gives you 43 minutes of allowable downtime per month, if you've already burned 40 of those minutes, every deployment is a real conversation.

Start with three SLOs per critical service: availability, latency, and correctness. These three cover most of what users actually experience.

Use burn rate alerts rather than raw error rate alerts. A service that's slightly degraded for a long time will blow through your error budget without ever triggering a threshold-based alert. Burn rate alerts catch slow deterioration early.

Share SLO dashboards with product and leadership, not just the engineering team. An SLO that only engineers can see is just a metric. One the whole organization tracks becomes a shared accountability tool.

When a service burns through its error budget, stop shipping features to it until reliability is restored. Without the deployment gate, the budget is just a number.

Level 4 is where observability stops being an engineering concern and becomes an organizational one. The clearest sign a team has actually internalized it is when reliability stops being debated in incident retrospectives and starts getting planned for in sprint planning.

Level 5: Proactive

The first four levels are reactive: you observe what's happening and respond. Level 5 is where that flips. You validate system behavior before users encounter it, you have visibility into the browser and frontend experience, and you catch performance regressions in CI/CD before they reach production. Teams at Level 5 fix problems their users never see. Most engineering teams never get here, not because it's technically hard, but because Level 4 never gets finished properly and there's always something more urgent. That's the argument for building this way from the start.

Grafana Synthetic Monitoring

Synthetic Monitoring runs scheduled probes against your endpoints from Grafana's global probe network, HTTP checks, DNS resolution, TCP connectivity, ping, and scripted browser tests, all running continuously from the perspective of a user in any geographic region.

A synthetic check that fires before a single real user is affected is the best alert you can have. Route synthetic failures through OnCall at the same priority as production incidents. They are production incidents, just ones you caught first.

Run probes from where your users actually are. A green check from us-east-1 means nothing if your users are in Western Europe and experiencing a regional routing issue.

An HTTP 200 check tells you the server responded. A scripted browser check that completes your login flow, adds an item to a cart, and verifies the total tells you your product works. Instrument the journeys that matter to your users.

Your availability SLO should use synthetic probe data, not just backend health checks. A service can pass every internal health check while being completely unreachable from the outside.

Grafana k6

k6 is Grafana's load testing tool, and the integration with Grafana Cloud is what makes it worth using here. Your load test results show up as metrics in Mimir alongside your production data, when you're running a test, you're watching the same dashboards your team uses during incidents, not a separate tool.

Running k6 only before a release means you're catching performance regressions weeks after they were introduced. Run it in CI instead, so you catch them in the same pull request. The earlier the catch, the cheaper the fix.

A k6 test with no thresholds is just a chart. Set p95 latency limits, error rate ceilings, and throughput floors as hard pass/fail conditions. If the test can't fail, it's not testing anything.

Watch your Mimir dashboards while k6 is running. Test results show user-visible behavior. Dashboards show what the system is doing internally. Together they give you symptoms and causes at the same time.

Run a k6 test before and after any significant infrastructure change. Resizing a service, adjusting a connection pool, changing resource limits: these all affect behavior under load in ways that are hard to predict.

Grafana Faro

Most backend observability tells you what your servers did. Faro tells you what your users experienced. It's a browser SDK that captures errors, performance data, and traces from the frontend and feeds them into the same Grafana Cloud stack as everything else. The most useful thing it does is connect a frontend user action to the backend trace it triggered.

Wire Faro into your backend OTel tracing so the trace ID generated in the browser propagates all the way through to the database. A trace that starts at the user click and ends at the query result is the most complete picture you can have of what your application actually does.

Backend latency SLOs don't capture the user experience. Your API can respond in 200ms and your interface can still feel slow because of rendering, layout shifts, or JavaScript blocking. Core Web Vitals measure what users actually experience. Track them.

Users don't report JavaScript errors. They just leave. Faro captures frontend exceptions automatically, a JS error affecting 2% of sessions will never show up in your support queue, but it shows up immediately in Faro.

Session replay is useful but not unlimited. Decide upfront which journeys are worth capturing, document it, and make sure your privacy policy reflects what you collect. Capturing everything by default is a GDPR conversation waiting to happen.

At Level 5, your team stops reacting and starts preventing. Synthetic monitoring catches endpoint failures before real users hit them. k6 catches performance regressions before they reach production. Faro surfaces browser-side issues that backend metrics can't see. Pyroscope shows you continuously where your services are actually spending time. Between them, you're finding and fixing problems before your users see them. That's the goal.

Maturity Summary

The table below summarizes the five levels covered in this playbook. Most organizations arrive at NextLink Labs at Level 1 or 2. The goal of an engagement is to reach Level 4 where reliability is formalized and operationally defended, and to have a clear path to Level 5.

Level 1 — Visibility: Metrics + basic dashboards. Know your numbers. Key products: Mimir, Grafana Dashboards, Alloy

Level 2 — Context: Logs correlated to metrics. Understand why, not just what. Key products: Loki, Alloy, Adaptive Logs

Level 3 — Depth: Distributed tracing wired to metrics and logs. Follow the request. Key products: Tempo, TraceQL, OTel

Level 4 — Reliability: Alerting with escalation, SLOs, and error budget tracking. Key products: Alerting, OnCall (IRM), SLOs

Level 5 — Proactive: Proactive testing, profiling, and frontend visibility before users notice. Key products: Synthetic Monitoring, k6, Pyroscope, Faro

What the table doesn't show is compounding. Level 3 traces are only useful if your Level 1 metric labels are consistent enough to correlate them. Level 4 SLOs are only meaningful if your Level 2 log data is structured enough to explain burn rate events. Skipping levels doesn't save time. You're taking on debt you'll pay later, usually with worse data and more pressure.

One question we get regularly is why this playbook uses Grafana Cloud rather than a self-hosted stack. When you run your own Grafana, Mimir, Loki, and Tempo, a real portion of your engineering time goes into operating those systems rather than using them. We've seen teams spend weeks on Mimir compaction issues, Loki retention configuration, and Tempo backend scaling. Grafana Cloud removes that entirely. Pre-built integrations mean you're collecting real signals within minutes of signing up, not days. That speed matters when you're trying to move through these levels while also shipping products.

About NextLink Labs

NextLink Labs is a platform engineering and DevOps consultancy. We work with engineering teams who are modernizing infrastructure or trying to build real observability rather than just dashboards. The practices in this playbook come from real client engagements, not theory.

If you want an honest assessment of where your stack sits on this maturity model, or help moving through the levels faster than you could on your own, reach out at nextlinklabs.com.

View full post