2026 - Draft case study

Infrastructure observability and reporting platform

A source-controlled observability and reporting platform for a multi-chain web estate, using Prometheus, Grafana, Loki, Tempo, Alloy, synthetic checks, and reviewed infrastructure-as-code.

Business benefits

Turns operations into one readable view
Cuts diagnosis time across APIs and sites
Keeps monitoring off public traffic paths
Escalates only customer-facing risk

Technical highlights

Prometheus jobs with tuned cadences
Grafana joins metrics, logs, and traces
Content-aware RPC and GraphQL probes
R2-backed Loki and Tempo retention

Capabilities

Metrics

Prometheus, Alertmanager, Blackbox Exporter, tuned scrape jobs

Reporting

Grafana dashboards, Explore links, derived fields

Logs and traces

Grafana Alloy, Loki, Tempo, OpenTelemetry, R2 retention

Access

Tailscale, UFW allowlists, sideband metrics plane

Context

Simple Things helped turn FLAYER infrastructure reporting from a series of disparate and siloed data points into one operational view. The work spans version-controlled monitoring configuration, Prometheus scrape design, Grafana dashboards, logs, traces, synthetic checks, and alert routing.

The important product outcome is simple operational oversight. Operators can see whether public sites are reachable, API fleets are healthy, RPC providers are returning real chain data, GraphQL indexes are fresh, Redis is available, routes are not slowing down, or one region is behaving differently from the rest.

Operating model

The reporting plane is the heart of the project. Prometheus, Grafana, Loki, Tempo, Alloy, Alertmanager, and synthetic checks sit beside the production services as a sideband observability system. The goal is not to add another dashboard. The goal is one place where an operator can answer whether the product is healthy, where it is slow, and what evidence explains the symptom.

The whole plane is treated as version-controlled configuration rather than console clicks. Scrape jobs, alert rules, dashboards, log collection, and trace collection all live in git. The Loki and Tempo stack deploys through a GitHub Actions workflow that bundles the configuration, ships it to the observability host, and brings the containers up with readiness checks. Prometheus and alert configuration is reviewed in git and pushed to the monitoring host with an in-place reload. That keeps the monitoring plane adjustable and repeatable without turning any server into a hand-edited source of truth.

This is especially useful for a multi-chain product estate. A user-facing issue can come from the application, proxy path, index freshness, RPC response quality, Redis availability, host capacity, queue behaviour, or public website reachability. The reporting platform is shaped so those layers can be compared quickly instead of investigated one console at a time.

Prometheus scrape design

Prometheus is treated as a dedicated control point rather than an afterthought inside each service. The configuration groups jobs by job purpose: application metrics, Kamal or proxy metrics, node exporters, Redis checks, RPC checks, public website probes, GraphQL freshness checks, and SSL expiry checks.

Scrape cadence is tuned by signal. Fast-moving service and external-response probes run more frequently, RPC node checks run on a short cadence because chain access is product-critical, standard app and host exporters run at a normal operational interval, public website blackbox probes run slightly slower, and database or slower dependency checks are kept less noisy.

The health checks are intentionally content-aware. RPC checks perform real JSON-RPC calls rather than only testing TCP reachability. GraphQL checks validate response bodies and index freshness instead of only accepting a 200 response. That matters because the user-facing failure mode is often stale data or a bad upstream response, not a dead server.

Prometheus configuration is managed as an artifact. Changes can be reviewed, pushed to the monitoring host, and reloaded in place. That keeps the monitoring plane adjustable without turning the server into a hand-edited source of truth.

Visual reporting

Grafana is the operator surface. The dashboards are arranged around glance rows first: public origins, proxy targets, app targets, Redis, request rate, 4xx percentage, 5xx percentage, p95 latency, and host capacity. That lets someone decide quickly whether they are looking at a site problem, an API problem, a dependency problem, or a capacity problem.

Deeper rows turn that glance into diagnosis. Panels break traffic down by route, status, instance, and region. Slow routes are ranked. Dependency latency and dependency errors are separated from application latency. Redis internals and host resources sit beside request metrics so the dashboard can show whether symptoms line up across layers.

Service-specific dashboards go further where needed. For the Web2 API, the reporting surface includes RPC operation health, dependency latency, dependency errors, queue behaviour, Node runtime signals, and request geography from Cloudflare country context. That makes the dashboard useful for product and operations, not only infrastructure.

The dashboard language is deliberately operator-friendly. It favours names, rows, and links that map to decisions: are origins up, are app targets up, which routes are slow, where are 5xxs coming from, are dependencies slow, and what changed around the same time.

Logs, traces, and evidence

Grafana Alloy collects structured Docker container logs and labels them by service, environment, region, instance, and container before shipping them to Loki. The retention model uses R2-backed long-term storage with a compacted index and bounded retention, giving enough time for incident review without turning logs into an unmanaged archive.

Application traces flow over OTLP/HTTP into host-level Alloy, then into Tempo with R2-backed retention. Slow request and slow dependency thresholds are emitted as first-class signals, so traces, logs, and metrics describe the same operational event from different angles.

The Grafana experience connects those angles. Derived fields turn trace IDs in logs into links to traces. Trace views link back to the matching logs. Traces can also link to relevant metrics over the same time window. The result is a practical investigation path: start from a dashboard symptom, jump to logs or traces, and come back to the metric context without manually rebuilding filters.

This is the difference between collecting telemetry and using telemetry. The system is designed so an operator can move from a high-level graph to the relevant evidence and back again while preserving service, region, instance, and time context.

Synthetic checks

The front-of-house checks cover public websites separately from backend health endpoints. Blackbox probes answer whether users can reach the public surfaces. Service health endpoints answer whether the app believes it is ready. RPC and GraphQL probes answer whether critical data paths are returning useful content.

Graph response checks are more than uptime probes. They issue a real query and validate body-level freshness so a page does not look healthy while serving stale indexed data. RPC probes perform real chain calls so provider or regional problems become visible before a product team has to infer them from user reports.

SSL expiry is tracked as its own operational signal with warning and urgent windows. That sort of check is simple, but it avoids a class of avoidable production incidents that can otherwise sit outside application monitoring.

Alerting and escalation

Alertmanager separates warning, critical, and page-level routes. The rules focus on symptoms that matter: sustained 5xx rate, high p95 latency, service quorum loss, host capacity risk, RPC failure rate, all-RPC failure, queue backlog, stalled work, and SSL expiry windows.

The escalation model is designed to reduce noise. A single transient probe should not wake someone up. A sustained 5xx rate, loss of quorum, all providers failing, or an expiry window that is close enough to create customer impact should. The case study can talk about that severity model without publishing receiver URLs or phone-routing details.

Alerts also line up with the dashboard structure. If a rule fires for p95, 5xx rate, RPC failure, queue backlog, or capacity, there is a matching reporting view that helps explain the alert rather than forcing an operator to start from a blank Explore tab.

Access and deployment discipline

The observability plane is deliberately separate from public traffic. Metrics access is allowlisted at the firewall, human access travels through private access paths, and public services do not need to expose their internal metrics endpoints directly.

That separation is important. It lets Prometheus reach the signals it needs while keeping metrics endpoints away from normal customer routes. It also means operational tooling can evolve without increasing the public attack surface of every application.

The deploy model favours reviewed configuration and controlled reloads. Dashboards, rules, scrape configs, log collection, and trace collection are treated as operational assets rather than clicks in a console. That is what makes the reporting system maintainable as the platform changes.

Need this kind of technical support?

Bring senior architecture and delivery judgement into the work.

I help teams improve platforms, delivery pipelines, performance, costs, replatforming plans, and end-to-end processes.

Start a conversation