Insights | NextLink Labs

Automatically Integrate Grafana with AWS Fargate: A Practical Pattern for 2026

Written by Aru Shanmugam | Jun 3, 2026 1:55:00 PM

AWS Fargate is one of the cleanest ways to run containers. You hand AWS a task definition, and the platform schedules, networks, and bills you per task. No EC2 fleet to patch. No node groups to right-size. No kubectl drain at 2 a.m.

That same abstraction is what makes Fargate observability frustrating. You can't SSH into the host. You can't run a node-level agent. You can't attach an eBPF probe, because Fargate's microVM environment doesn't expose the kernel hooks eBPF tools rely on. CloudWatch gives you the basics, but the moment you want correlated metrics, logs, and traces in one pane of glass, the question becomes: where does the agent actually live, and how do I deploy it without bolting on snowflake configuration to every task?

The good news is that the answer has gotten significantly cleaner. Grafana Agent has been replaced by Grafana Alloy, AWS has shipped a first-class ECS task metadata receiver, and the sidecar pattern has matured to the point where adding Grafana observability to a new Fargate service is a single block of Terraform. That last sentence is the whole point of this post.

All code in this post is taken directly from the public fargate-with-alloy-demo repository. Clone it and run terraform apply to get a working deployment. All modules are bundled under modules/ — no external dependencies required.

Why This Is Harder Than It Looks

AWS Fargate is one of the cleanest ways to run containers. You hand AWS a task definition, and the platform schedules, networks, and bills you per task. No EC2 fleet to patch. No node groups to right-size. No kubectl drain at 2 a.m.

That same abstraction is what makes Fargate observability frustrating. You can't SSH into the host. You can't run a node-level agent. You can't attach an eBPF probe, because Fargate's microVM environment doesn't expose the kernel hooks eBPF tools rely on. CloudWatch gives you the basics, but the moment you want correlated metrics, logs, and traces in one pane of glass, the question becomes: where does the agent actually live, and how do I deploy it without bolting on snowflake configuration to every task?

The good news is that the answer has gotten significantly cleaner. Grafana Agent has been replaced by Grafana Alloy, AWS has shipped a first-class ECS task metadata receiver, and the sidecar pattern has matured to the point where adding Grafana observability to a new Fargate service is a single block of Terraform. That last sentence is the whole point of this post.

What Changed in the Last Year

 
Grafana Agent is deprecated in favor of Grafana Alloy. Alloy is Grafana Labs' OpenTelemetry-compatible collector with native components for metrics, logs, traces, and profiles. If you're starting fresh in 2026, stand up Alloy, not Agent.
 
Alloy has a native ECS receiver. The otelcol.receiver.awsecscontainermetrics component reads the ECS Task Metadata Endpoint V4 directly and emits CPU, memory, network, and disk metrics for every container in the task. No exporter sidecar needed. This component is still experimental — you need --stability.level=experimental to enable it.
 
Amazon ECS Managed Instances launched on September 30, 2025. This gives you Fargate-like managed lifecycle on EC2 instances of your choosing — useful if you've previously rejected Fargate on GPU or cost grounds. The same sidecar pattern described here works identically with it.
 
Grafana Beyla does not work on Fargate. Beyla is Grafana's eBPF-based auto-instrumentation tool. Fargate's lack of eBPF kernel hooks means it doesn't apply. If you need code-free auto-instrumentation on Fargate, use OpenTelemetry SDK auto-instrumentation libraries.

The Three Integration Paths

Path 1: Alloy sidecar to Grafana Cloud (or self-hosted Grafana). The application task definition adds an Alloy container alongside your app. Alloy pulls ECS task metrics, accepts OTLP from your instrumented code, and ships everything to a Grafana Cloud stack. Best for: teams that want a single observability backend across AWS and non-AWS workloads, or teams already committed to Grafana Cloud. This is the path this post implements.

 

Path 2: AWS Distro for OpenTelemetry (ADOT) to Amazon Managed Prometheus + Amazon Managed Grafana. ADOT is the AWS-distributed OpenTelemetry collector. Run it as a sidecar with a config pulled from SSM, and have it forward to AMP and AMG. Best for: teams that want everything inside AWS for compliance, billing consolidation, or IAM-native auth.

 

Path 3: CloudWatch as the floor, Grafana on top. Let AWS collect logs and metrics via Container Insights and the awslogs driver, then point Grafana at the CloudWatch data source. No sidecar required. Best for: small environments and MVPs. The downside is you're locked into CloudWatch's query language, granularity, and retention pricing.

 

The Automatic Pattern: A Two-Layer Architecture

The repo separates concerns into two layers that are deployed together with a single terraform apply but have independent lifecycles:

 
Infrastructure layer — VPC, security groups, ECS cluster. Set up once. Shared by all services in the deployment.
 
Service layer — one module block per Fargate service. Each block deploys an app container plus an Alloy sidecar, along with all supporting resources: SSM config, Secrets Manager credentials, IAM roles, CloudWatch log groups, task definition, and ECS service.

Adding a new observed service means copying one block and running terraform apply. The repository layout reflects this:

fargate-with-alloy/
├── main.tf                ← infrastructure layer + service module calls
├── variables.tf
├── outputs.tf
├── version.tf
├── terraform.tfvars        ← your Grafana Cloud credentials go here

└── modules/
    ├── fargate-with-alloy/    ← THE reusable module (one block per service)
    │ ├── main.tf            ← SSM, Secrets Manager, IAM, log groups, task def, service
    │ ├── variables.tf
    │ ├── outputs.tf
    │ ├── version.tf
    │ └── config/
    │    └── alloy.alloy     ← bundled default Alloy config
    
    ├── vpc/                   ← VPC with public/private subnets, IGW, NAT Gateway
    ├── security-group/        ← security groups with ingress/egress rules
    └── ecs/                   ← ECS cluster primitive (used by the infrastructure layer)

 

All modules are local. There are no external module registry references or remote git sources — clone the repo and terraform init will work immediately.

Step 0: Clone the Repo and Set Your Credentials

git clone https://gitlab.com/nextlink/devops/fargate-with-alloy-demo
cd fargate-with-alloy-demo

 

Open terraform.tfvars and fill in your Grafana Cloud OTLP credentials. Find them in Grafana Cloud → Connections → OpenTelemetry → Other. The Instance ID is the numeric Stack ID shown on your account page.

terraform.tfvars 

aws_region = "us-east-1"
name = "fargate-alloy-demo"
environment = "dev"

# Paste your Grafana Cloud values here
grafana_otlp_endpoint = "https://otlp-gateway-prod-us-east-0.grafana.net/otlp"
grafana_instance_id = "<YOUR_INSTANCE_ID>"
grafana_api_token = "<YOUR_API_TOKEN>"

 

Add it to .gitignore before committing anything:

echo "terraform.tfvars" >> .gitignore
echo ".terraform/" >> .gitignore
echo "*.tfstate*" >> .gitignore

 

Step 1: Provider and Version Constraints

The root version.tf pins Terraform and the AWS provider:

version.tf 

terraform {
  required_version = ">= 1.10.0"
  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

 

Step 2: Networking Layer

ECS tasks run in private subnets. A NAT Gateway gives each Alloy sidecar outbound access to the Grafana Cloud OTLP endpoint. The security group is intentionally outbound-only: intra-task communication between the app and Alloy happens over localhost and never crosses the security group boundary.

main.tf — networking 

# ── VPC ───────────────────────────────────────────────────────────────────────
module "vpc" {
  source = "./modules/vpc"
  name = var.name
  vpc_cidr = var.vpc_cidr
  az_count = var.az_count
  use_case = "ecs"
  enable_igw = true
  enable_nat_gateway = true
  single_nat_gateway = var.single_nat_gateway
  tags = local.tags
}

# ── Security Groups ───────────────────────────────────────────────────────────
# Tasks only need outbound access: Alloy → Grafana Cloud (443) and ECR pulls (443).
# Intra-task communication (app → Alloy via localhost) does not traverse the SG.
module "security_group" {
  source = "./modules/security-group"
  vpc_id = module.vpc.vpc_id
  security_groups = {
    ecs_tasks = {
      name = "${var.name}-ecs-tasks"
      description = "ECS Fargate task SG for ${var.name}: outbound-only (Alloy to Grafana Cloud)"
      ingress_rules = []
      egress_rules = [
        {
          from_port = 443
          to_port = 443
          protocol = "tcp"
          cidr_blocks = ["0.0.0.0/0"]
          description = "OTLP HTTPS to Grafana Cloud and ECR image pulls"
        },
        {
          from_port = 80
          to_port = 80
          protocol = "tcp"
          cidr_blocks = ["0.0.0.0/0"]
          description = "HTTP for ECR and ECS metadata endpoints"
        },
      ]
    }
  }
  tags = local.tags
}

 

Step 3: ECS Cluster

The cluster is a direct resource, not a module call, because the cluster is shared infrastructure — it lives once and hosts all services. Container Insights is enabled at the cluster level so it works as a CloudWatch fallback even when Alloy is handling the primary signal path.

main.tf — ECS cluster 

# ── ECS Cluster ───────────────────────────────────────────────────────────────
resource "aws_ecs_cluster" "this" {
  name = var.name
  setting {
    name = "containerInsights"
    value = "enabled"
  }
  tags = merge(local.tags, { Name = var.name })
}

resource "aws_ecs_cluster_capacity_providers" "this" {
  cluster_name = aws_ecs_cluster.this.name
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]
  default_capacity_provider_strategy {
    capacity_provider = var.default_capacity_provider
    weight = 1
    base = 0
  }
}

 

Step 4: The Alloy Config

The config is bundled inside the module at modules/fargate-with-alloy/config/alloy.alloy. The module reads it with file("${path.module}/config/alloy.alloy") and stores it as an SSM SecureString parameter so you can update the config without rebuilding the container image — just run terraform apply -target=aws_ssm_parameter.alloy_config followed by a forced ECS redeployment.

If you need a custom pipeline (Prometheus scrape jobs, log filtering, a different export destination), set the alloy_config module variable to override the bundled default.

modules/fargate-with-alloy/config/alloy.alloy 

// ── Alloy configuration for Fargate sidecar pattern ──────────────────────────
//
// Receives OTLP signals from the app container on localhost (containers in the
// same Fargate task share a network namespace, so no DNS resolution needed).
// Scrapes ECS task/container metrics from the metadata endpoint V4.
// Forwards everything to Grafana Cloud via OTLP HTTP with Basic Auth.
//
// Environment variables injected from AWS Secrets Manager at task start:
// GRAFANA_OTLP_ENDPOINT — Grafana Cloud OTLP gateway URL
// GRAFANA_INSTANCE_ID — Grafana Cloud instance ID (Basic Auth username)
// GRAFANA_API_TOKEN — Grafana Cloud API token (Basic Auth password)

// ── Receive OTLP signals from the application container ──────────────────────
otelcol.receiver.otlp "app" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }
  http {
    endpoint = "0.0.0.0:4318"
  }
  output {
    metrics = [otelcol.processor.batch.default.input]
    traces = [otelcol.processor.batch.default.input]
    logs = [otelcol.processor.batch.default.input]
  }
}

// ── Scrape ECS task and container metrics from the metadata endpoint ──────────
// Uses ECS_CONTAINER_METADATA_URI_V4 injected automatically by Fargate.
// Requires --stability.level=experimental on the Alloy command line.
otelcol.receiver.awsecscontainermetrics "ecs" {
  output {
    metrics = [otelcol.processor.batch.default.input]
  }
}

// ── Batch signals before exporting (reduces API call volume) ─────────────────
otelcol.processor.batch "default" {
  timeout = "5s"
  output {
    metrics = [otelcol.exporter.otlphttp.grafana_cloud.input]
    traces = [otelcol.exporter.otlphttp.grafana_cloud.input]
    logs = [otelcol.exporter.otlphttp.grafana_cloud.input]
  }
}

// ── Export to Grafana Cloud via OTLP HTTP ────────────────────────────────────
otelcol.exporter.otlphttp "grafana_cloud" {
  client {
    endpoint = sys.env("GRAFANA_OTLP_ENDPOINT")
    auth = otelcol.auth.basic.grafana_cloud.handler
  }
}

otelcol.auth.basic "grafana_cloud" {
  username = sys.env("GRAFANA_INSTANCE_ID")
  password = sys.env("GRAFANA_API_TOKEN")
}

 

Three things to note:

 
The awsecscontainermetrics receiver needs no credentials. It hits ECS_CONTAINER_METADATA_URI_V4, which Fargate injects into every container's environment automatically.
 
The OTLP endpoint, instance ID, and API token come from environment variables injected at task start from Secrets Manager. Nothing sensitive is baked into the config file.
 
--stability.level=experimental is required on the Alloy command line because awsecscontainermetrics is still experimental. The module handles this in the container command.

Step 5: Deploy Your First Service — One Module Block

This is the entire service deployment. The module takes the cluster, subnets, and security groups you just created and wires up everything else automatically.

main.tf — service module call 

# ── Services ──────────────────────────────────────────────────────────────────
# Each service is one module block. Adding a new observed Fargate service means
# copying this block, changing the name and app_image, and running terraform apply.
module "demo_app" {
  source = "./modules/fargate-with-alloy"
  name = "${var.name}-${var.service_name}"
  app_image = var.app_image
  app_port = var.app_port
  task_cpu = var.task_cpu
  task_memory = var.task_memory
  cluster_name = aws_ecs_cluster.this.name
  subnet_ids = module.vpc.private_subnet_ids
  security_group_ids = [module.security_group.security_group_ids["ecs_tasks"]]
  grafana_otlp_endpoint = var.grafana_otlp_endpoint
  grafana_instance_id = var.grafana_instance_id
  grafana_api_token = var.grafana_api_token
  alloy_image = var.alloy_image
  alloy_essential = var.alloy_essential
  capacity_provider = var.default_capacity_provider
  log_retention_days = var.log_retention_days
  enable_execute_command = true
  tags = local.tags
}

 

The name variable is the only truly critical input to get right — it prefixes every resource the module creates: IAM role names, SSM parameter path, Secrets Manager secret name, CloudWatch log group names, ECS task family, and ECS service name. It must be unique within your AWS account.

To deploy:

terraform init
terraform plan -out=tfplan
terraform apply tfplan

 

Total apply time is roughly five minutes, dominated by NAT Gateway provisioning. After apply, verify that Alloy started correctly:

# Alloy startup — look for "Running component" lines
aws logs tail /ecs/fargate-alloy-demo/alloy --follow --region us-east-1

# App container
aws logs tail /ecs/fargate-alloy-demo/demo-app --follow --region us-east-1

 

In Grafana Cloud → Explore → Metrics, query:

container_cpu_usage_seconds_total{container="demo-app"}

 

Once metrics are flowing, you'll see all four signal types in Grafana Cloud. Here are the live container metrics from a working deployment — CPU usage per vCPU, CPU utilization, memory utilization in megabytes, and network receive rate, all broken down by container (alloy in green, demo-app in yellow):

container_cpu_usage_vcpu_vCPU — the Alloy sidecar (green) settles to a steady baseline after an initial spike; the demo-app (yellow) shows minimal CPU consumption throughout.

container_cpu_utilized_None — Alloy's CPU utilization climbs briefly around 19:35 during a metrics flush cycle, then returns to its ~0.06 baseline. The demo-app holds near zero.

container_memory_utilized_Megabytes — Alloy holds a flat ~68 MB footprint; the demo-app sits at ~25 MB. Both lines are stable, which is the expected profile for a steady-state sidecar deployment.

container_network_rate_rx_Bytes_per_Second — the demo-app's inbound network traffic (yellow) shows the expected bursty pattern of a running service. The Alloy sidecar's receive rate (green) is indistinguishably low — it's primarily an outbound component.

Step 6: What the Module Creates Under the Hood

When you call module "demo_app", the fargate-with-alloy module creates the following resources automatically. You never write this boilerplate again for any subsequent service.

SSM + Secrets Manager 

The Alloy config is stored as an SSM SecureString. This is not a style preference — it is required. ECS's secrets field (which injects values as environment variables before the container starts) only works with SSM SecureString or Secrets Manager. A plain String parameter silently fails to inject.

The three Grafana Cloud credentials are stored as a single JSON secret in Secrets Manager so ECS can extract individual keys using the <secret_arn>:<json-key>:: syntax.

modules/fargate-with-alloy/main.tf — SSM + Secrets Manager 

resource "aws_ssm_parameter" "alloy_config" {
  name = "/${var.name}/alloy/config"
  description = "Grafana Alloy config for the ${var.name} Fargate service"
  type = "SecureString"
  value = local.alloy_config_content
  tags = local.tags
}

resource "aws_secretsmanager_secret" "grafana" {
  name = "${var.name}/grafana-cloud"
  description = "Grafana Cloud OTLP credentials for ${var.name} Alloy sidecar"
  recovery_window_in_days = 7
  tags = local.tags
}

resource "aws_secretsmanager_secret_version" "grafana" {
  secret_id = aws_secretsmanager_secret.grafana.id
  secret_string = jsonencode({
    endpoint = var.grafana_otlp_endpoint
    instance_id = var.grafana_instance_id
    api_token = var.grafana_api_token
  })
}

 

IAM Roles 

Two separate roles are not optional — it is how ECS separates infrastructure permissions from application permissions.

The execution role is used by the ECS agent before the task starts to pull images from ECR and inject secrets into the task environment. The task role is used by running containers at runtime and is scoped to only what the containers actually need: CloudWatch Logs writes and ECS Exec channel permissions.

The module defines both roles as direct aws_iam_role resources rather than going through the IAM module. This is intentional: the inline policies reference computed ARNs (the SSM parameter and the Secrets Manager secret), which are unknown at plan time. A for_each-based IAM module cannot evaluate its filter condition on unknown values and will fail during terraform plan.

modules/fargate-with-alloy/main.tf — IAM roles 

# Execution role — used by the ECS agent before the task starts
resource "aws_iam_role" "execution" {
  name = "${var.name}-execution"
  assume_role_policy = data.aws_iam_policy_document.ecs_task_assume.json
  tags = merge(local.tags, { Name = "${var.name}-execution" })
}

resource "aws_iam_role_policy_attachment" "execution_managed" {
  role = aws_iam_role.execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

resource "aws_iam_role_policy" "execution_inline" {
  name = "${var.name}-execution-inline"
  role = aws_iam_role.execution.name
  policy = data.aws_iam_policy_document.execution_inline.json
}

# Task role — used by running containers at runtime
resource "aws_iam_role" "task" {
  name = "${var.name}-task"
  assume_role_policy = data.aws_iam_policy_document.ecs_task_assume.json
  tags = merge(local.tags, { Name = "${var.name}-task" })
}

resource "aws_iam_role_policy" "task_inline" {
  name = "${var.name}-task-inline"
  role = aws_iam_role.task.name
  policy = data.aws_iam_policy_document.task_inline.json
}

 

Container Definitions and Task 

The app and Alloy containers are built as locals, then passed to aws_ecs_task_definition. Key wiring points:

 
The app container receives OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 automatically. All containers in a Fargate task share one network namespace, so localhost resolves to the Alloy sidecar without any DNS.
 
The app_environment and app_secrets module variables let you inject additional env vars or SSM/Secrets Manager references into the app container without modifying the module itself.
 
The Alloy container uses a shell wrapper to write the SSM-injected config content to disk before starting Alloy. exec replaces the shell so Alloy receives SIGTERM directly.

modules/fargate-with-alloy/main.tf — containers, task definition, service 

locals {
  app_container = {
    name = var.name
    image = var.app_image
    essential = true
    portMappings = [{ containerPort = var.app_port, protocol = "tcp", name = "http" }]
    environment = concat(
      [
        { name = "OTEL_EXPORTER_OTLP_ENDPOINT", value = "http://localhost:4318" },
        { name = "OTEL_SERVICE_NAME", value = var.name },
      ],
      var.app_environment
    )
    secrets = var.app_secrets
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group" = aws_cloudwatch_log_group.app.name
        "awslogs-region" = local.region
        "awslogs-stream-prefix" = "app"
      }
    }
  }
  alloy_container = {
    name = "alloy"
    image = var.alloy_image
    essential = var.alloy_essential
    secrets = [
      { name = "ALLOY_CONFIG_CONTENT", valueFrom = aws_ssm_parameter.alloy_config.arn },
      { name = "GRAFANA_OTLP_ENDPOINT", valueFrom = "${aws_secretsmanager_secret.grafana.arn}:endpoint::" },
      { name = "GRAFANA_INSTANCE_ID", valueFrom = "${aws_secretsmanager_secret.grafana.arn}:instance_id::" },
      { name = "GRAFANA_API_TOKEN", valueFrom = "${aws_secretsmanager_secret.grafana.arn}:api_token::" },
    ]
    entryPoint = ["/bin/sh", "-c"]
    command = [
      "mkdir -p /etc/alloy && printenv ALLOY_CONFIG_CONTENT > /etc/alloy/config.alloy && exec /bin/alloy run --server.http.listen-addr=0.0.0.0:12345 --stability.level=experimental /etc/alloy/config.alloy"
    ]
    portMappings = [{ containerPort = 12345, protocol = "tcp", name = "alloy-ui" }]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group" = aws_cloudwatch_log_group.alloy.name
        "awslogs-region" = local.region
        "awslogs-stream-prefix" = "alloy"
      }
    }
  }
}

resource "aws_ecs_task_definition" "this" {
  family = var.name
  requires_compatibilities = ["FARGATE"]
  network_mode = "awsvpc"
  cpu = var.task_cpu
  memory = var.task_memory
  execution_role_arn = aws_iam_role.execution.arn
  task_role_arn = aws_iam_role.task.arn
  container_definitions = jsonencode([
    local.app_container,
    local.alloy_container,
  ])
  tags = merge(local.tags, { Name = var.name })
  depends_on = [
    aws_cloudwatch_log_group.app,
    aws_cloudwatch_log_group.alloy,
    aws_secretsmanager_secret_version.grafana,
  ]
}

resource "aws_ecs_service" "this" {
  name = var.name
  cluster = var.cluster_name
  task_definition = aws_ecs_task_definition.this.arn
  desired_count = var.desired_count
  enable_execute_command = var.enable_execute_command
  capacity_provider_strategy {
    capacity_provider = var.capacity_provider
    weight = 1
    base = 0
  }
  network_configuration {
    subnets = var.subnet_ids
    security_groups = var.security_group_ids
    assign_public_ip = false
  }
  lifecycle {
    ignore_changes = [desired_count]
  }
  tags = merge(local.tags, { Name = var.name })
}

 

Step 7: Variables and Outputs

The module surfaces every tuning knob you will realistically need, with sensible defaults so the minimum call only requires name, app_image, app_port, cluster_name, subnet_ids, security_group_ids, and the three Grafana Cloud credentials.

modules/fargate-with-alloy/variables.tf (condensed) 

variable "name" { type = string }
variable "app_image" { type = string }
variable "app_port" { type = number }
variable "task_cpu" { type = number; default = 1024 }
variable "task_memory" { type = number; default = 2048 }
variable "cluster_name" { type = string }
variable "subnet_ids" { type = list(string) }
variable "security_group_ids" { type = list(string) }
variable "grafana_otlp_endpoint" { type = string; sensitive = true }
variable "grafana_instance_id" { type = string; sensitive = true }
variable "grafana_api_token" { type = string; sensitive = true }
variable "alloy_image" { type = string; default = "grafana/alloy:v1.16.1" }
variable "alloy_essential" { type = bool; default = true }
variable "alloy_config" { type = string; default = null }
variable "desired_count" { type = number; default = 1 }
variable "capacity_provider" { type = string; default = "FARGATE" }
variable "enable_execute_command" { type = bool; default = true }
variable "log_retention_days" { type = number; default = 14 }
variable "tags" { type = map(string); default = {} }

 

modules/fargate-with-alloy/outputs.tf 

output "service_name" { value = aws_ecs_service.this.name }
output "service_id" { value = aws_ecs_service.this.id }
output "task_definition_arn" { value = aws_ecs_task_definition.this.arn }
output "execution_role_arn" { value = aws_iam_role.execution.arn }
output "task_role_arn" { value = aws_iam_role.task.arn }
output "alloy_config_ssm_arn" { value = aws_ssm_parameter.alloy_config.arn }
output "grafana_secret_arn" { value = aws_secretsmanager_secret.grafana.arn }
output "app_log_group" { value = aws_cloudwatch_log_group.app.name }
output "alloy_log_group" { value = aws_cloudwatch_log_group.alloy.name }

 

Step 8: Adding a Second Service

This is the payoff. No new IAM resources to write. No SSM parameters to hand-craft. No task definition JSON to maintain. One module block, terraform apply, done.

main.tf — adding a second service 

# Adding a second service is this simple:
module "checkout" {
  source = "./modules/fargate-with-alloy"
  name = "${var.name}-checkout"
  app_image = "123456.dkr.ecr.us-east-1.amazonaws.com/checkout:v1.2.3"
  app_port = 8080
  task_cpu = 1024
  task_memory = 2048
  cluster_name = aws_ecs_cluster.this.name
  subnet_ids = module.vpc.private_subnet_ids
  security_group_ids = [module.security_group.security_group_ids["ecs_tasks"]]
  grafana_otlp_endpoint = var.grafana_otlp_endpoint
  grafana_instance_id = var.grafana_instance_id
  grafana_api_token = var.grafana_api_token
}

 

Each service gets its own isolated set of IAM roles, SSM parameter, Secrets Manager secret, CloudWatch log groups, task definition, and ECS service. Resources are named with the name prefix so nothing collides. Same Grafana Cloud credentials, same cluster, same VPC — every service streams to the same Grafana Cloud stack with its own service label set by OTEL_SERVICE_NAME.

To update the Alloy config for a service without rebuilding its image:

# Edit modules/fargate-with-alloy/config/alloy.alloy (or set alloy_config override)
terraform apply -target=module.demo_app.aws_ssm_parameter.alloy_config

# Force a new task deployment to pick up the new config
aws ecs update-service \
  --cluster fargate-alloy-demo \
  --service fargate-alloy-demo-demo-app \
  --force-new-deployment \
  --region us-east-1

 

Scaling Considerations: One Sidecar Per Task

The sidecar pattern scales horizontally by design — every task replica gets its own Alloy container, and each ships its signals independently. At low replica counts, that is entirely fine. At a high replica count (e.g., 30 or more), it is worth thinking through the implications.

What you get for free. Blast radius is contained. If one task's Alloy instance backs up, crashes, or hits a network partition, it only affects that task's telemetry. There is no shared collector that becomes a single point of failure for the whole service, and you do not need to size or scale a separate infrastructure component.

 

What you are trading away. With a high number of replicas (e.g., 50), you have 50 Alloy processes, each maintaining its own batch buffer, its own OTLP connection to the Grafana Cloud gateway, and its own scrape loop against the ECS Task Metadata endpoint. At Grafana Cloud's ingest pricing, metric cardinality is usually the bigger cost driver than raw connection count.

 

The alternative: a centralized collector service. If replica counts are large enough to make the aggregate OTLP connection overhead meaningful, or if you want a single choke point for back-pressure, retry logic, or sampling rules, the standard alternative is to run Alloy as a separate ECS service behind a Service Connect or internal load balancer endpoint, and have all app tasks point OTEL_EXPORTER_OTLP_ENDPOINT at that collector rather than localhost.

 

A practical threshold. For most teams, the sidecar model is the right default up to several dozen replicas per service. The crossover point where a centralized collector starts paying for itself is roughly when aggregate OTLP connection cost or duplicated metadata becomes visible in your billing, or when your platform team wants a single place to apply sampling or routing rules across all services.

 

Gotchas Worth Knowing Before You Ship

 
awsecscontainermetrics is experimental. Pin your Alloy version and pass --stability.level=experimental. The module handles this, but review Alloy release notes before upgrading the image.
 
OTLP target is localhost, not the sidecar's container name. All containers in a Fargate task share one network namespace. The module injects OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 automatically.
 
SSM secrets injection requires SecureString. A plain String parameter silently fails to inject at task start. The module always uses SecureString.
 
alloy_essential = true is the recommended default. If both containers are essential, an Alloy crash restarts the whole task — usually preferable to silently dropping metrics. Set alloy_essential = false only if your app's uptime SLA is tighter than your observability SLA.
 
High Alloy storage reads in Fargate. Some teams have reported unexpected disk read patterns (grafana/alloy#4684). Pin your Alloy version and check release notes before upgrading.
 
Container Insights is enabled at the cluster level. This gives you a CloudWatch-native fallback if your Grafana Cloud stack has an outage.
 
Computed ARNs and IAM modules. The module defines IAM roles as direct aws_iam_role resources. Inline policies that reference SSM and Secrets Manager ARNs are unknown at plan time — a for_each-based IAM wrapper will fail during terraform plan.
 
The name prefix must be unique per AWS account. Duplicate names cause apply failures on IAM role creation and Secrets Manager secret creation.

Tearing Down

terraform destroy

 

The Secrets Manager secret has a 7-day recovery window by default. If you want to re-deploy with the same name immediately, use:

aws secretsmanager delete-secret --secret-id <arn> --force-delete-without-recovery

 

Where This Is Heading

 
Alloy's ECS receiver moving out of experimental. Once stable, the --stability.level=experimental flag drops and the configuration surface shrinks.
 
Amazon ECS Managed Instances giving teams a third capacity option between Fargate and self-managed EC2. The fargate-with-alloy module works with Managed Instances without modification.
 
OpenTelemetry's eBPF instrumentation project (upstream destination for Grafana Beyla after its CNCF donation). Once Fargate eventually exposes eBPF hooks, code-free auto-instrumentation becomes viable on serverless containers.

The bigger arc: observability on Fargate has gone from "build your own task definition boilerplate for every service" to "copy one module block and run terraform apply."

Wrapping Up

Clone the fargate-with-alloy-demo repository, fill in your Grafana Cloud credentials in terraform.tfvars, and run terraform apply. You will have task-level CPU, memory, network, and disk metrics flowing in Grafana Cloud within about five minutes. From there it is incremental: add Prometheus scrape jobs via the alloy_config override, wire your app's OTLP SDK to localhost:4318, add log shipping.

At NextLink Labs, we have helped clients automate this pattern across dozens of services. If you are staring at a hand-rolled CloudWatch setup and wondering if there is a better way, there is — and the on-ramp is shorter than you think.