8th June 2026

Building one pane of glass for a multi-chain web platform

How we gave a small team complete operational visibility — metrics, logs, traces, and synthetic checks — across sixteen API hosts, eight RPC endpoints, and three public websites, all flowing into a single Grafana they could actually read at 3am.

When we picked this up, the platform was already busy. A memecoin launch API running sixteen application hosts across two regions. A second API on its own pair of boxes. A third single-host service. Three public websites. Our own Ethereum, Base, and Arbitrum RPC nodes feeding a fleet of Graph indexers. Plenty of moving parts — and almost no way to see them.

The brief was the kind we like: make it calm. When something breaks, the on-call engineer should know within a minute, know which of the moving parts is at fault, and be able to drill from "users are getting errors" down to the exact failed request — its trace, its logs, its status code — without guessing. No heroics, no tribal knowledge, no SSHing into sixteen boxes to grep.

This is how we built it, and the decisions we'd make the same way again.

The mental model: three pillars, one box, sideband everywhere

The whole design rests on a single idea: observability is sideband. Production traffic never depends on it. Every host ships its metrics, logs, and traces out to a separate plane, and if that plane disappears, the site keeps serving — you just go temporarily blind. That distinction shaped everything downstream, including how aggressively we were willing to centralise.

The three pillars are the standard ones, deliberately kept boring:

Metrics — Prometheus scraping every host and service on a tight interval.
Logs — structured application logs collected by Grafana Alloy, shipped to Loki.
Traces — OpenTelemetry spans over OTLP into Tempo.

And then one Grafana stitches them together so a human can move between all three in a couple of clicks. The cleverness isn't in any single pillar — it's in the links between them. More on that later.

Prometheus: tight intervals, honest health checks

Prometheus runs on a dedicated host and scrapes roughly two dozen jobs on a 15-second global interval, with evaluation of alert rules on the same cadence. We tuned the interval per job rather than accepting one number for everything:

5 seconds for the things that fail fastest and matter most — the Graph response check and a couple of external-data probes.
12 seconds for RPC nodes (erigon, reth, op-node, op-reth, nitro), scraped on their /debug/metrics/prometheus endpoints.
15 seconds for application metrics, Kamal proxies, and node exporters.
30 seconds for website blackbox probes.
60 seconds for Postgres and the slower RPC synthetic checks.

The jobs themselves cover the full stack: node-exporter across every box for host capacity; per-service application jobs (flaunch-v2-app-prod, slabcash-app-prod, flaunch-web2-api-prod and its worker, and so on) on a dedicated :9464 metrics port; the Kamal proxy fronting each service; Redis, nginx, HAProxy, Postgres, and the Graph nodes.

Two details we care about:

Health checks tell the truth. It's easy to mark a backend "up" because a TCP port answered. We didn't. The RPC layer's checks issue a real eth_blockNumber JSON-RPC probe and only count a node healthy if it returns a proper result. A node that's accepting connections but has fallen behind the chain head is not up, and we wanted our monitoring to say so.

Scrape access is locked down. Every application host UFW-allowlists exactly two addresses on its metrics port: the Prometheus host's public IP, and its Tailscale address. Metrics are operational data; they don't belong on the open internet. Everything human — SSH, config deploys — happens over Tailscale only.

A note on the operational discipline, because it's the kind of thing that bites you: Prometheus on this estate is reloaded with killall -HUP prometheus, not a service restart, and config changes are pushed by scp to the host — never a git pull on the server. Small rules, written down, that keep a shared critical box predictable.

Logs: structured, labelled, and stored where storage is cheap

Logs are collected by Grafana Alloy running as a native service on each application host. Alloy discovers the running Docker containers over the local socket, filters them to the ones we care about with a per-host container regex, and ships their stdout to Loki. Crucially, it attaches a consistent label set on the way out:

Code

service   = flaunch-v2-api | slabcash-api | flaunch-web2-api
env       = prod
region    = eu | us
instance  = <hostname>
container = <docker container name>

That label discipline is what makes the logs queryable rather than just stored. From Grafana's Explore tab, an engineer narrows to a service and environment and then filters structured fields:

logql

{service="flaunch-v2-api", env="prod"} |= "error" | logfmt | status >= 500

— all 5xx server errors for one service, parsed out of the structured log line. Or pivot the other way and search every service at once for a single request ID a user reported:

logql

{env="prod"} |= "abc123-request-id-here"

Storing and rotating logs in Cloudflare R2

Here's a decision we're particularly happy with. Loki's long-term storage backend is Cloudflare R2.

Loki speaks the S3 API, and R2 implements it — with one enormous difference that matters for a log store: R2 has no egress fees. Logs are write-heavy and, mercifully, read-rarely; you pour gigabytes in and only pull a sliver back out when you're debugging. On a conventional cloud object store, the occasional wide log query can produce a surprising egress bill. On R2, reads are free. We pay for what we store and nothing for what we query.

The mechanics:

Loki uses the TSDB schema (v13) with a 24-hour index period.
Chunks and index live in an R2 bucket (flayer-loki-prod), addressed through the S3-compatible endpoint with path-style addressing and TLS on.
The host keeps only a small local working set — index cache, compactor scratch — under /loki. The durable data is all in R2.

Rotation is automatic and enforced by the compactor. We set a retention period — 14 days (336h) for logs — and Loki's compactor deletes expired chunks against the R2-backed delete store on its own schedule. No cron jobs trimming disk, no logrotate configs drifting per host, no 2am "disk full" page because a log volume filled up. Retention is one config value, applied centrally, honoured globally.

Traces: from "something's slow" to the exact span

Application services are instrumented with OpenTelemetry, emitting spans over OTLP/HTTP. The path is deliberately short: each app container exports to the Alloy instance on its own host (reached over the Kamal Docker bridge), Alloy batches and forwards to Tempo, and Tempo writes blocks to — again — Cloudflare R2 (flayer-tempo-prod), with a 7-day retention.

Traces are the most expensive pillar to keep and the one you read least often, which makes the no-egress R2 backing even more attractive here than for logs.

The OpenTelemetry config carries two thresholds worth calling out, because they turn raw spans into signal:

Code

SLOW_REQUEST_MS    = 1000
SLOW_DEPENDENCY_MS = 250

A request over a second, or a downstream dependency call (RPC, Redis, an external API) over 250ms, gets flagged as slow — so "slow dependencies" becomes a thing you can list and alert on, not something you have to eyeball across thousands of normal spans.

The links between the pillars — where it actually pays off

Anyone can stand up three databases. The value is in the joins, and Grafana's derived fields are how we made them.

Logs → traces. Every structured log line carries a trace_id. We configured the Loki datasource with a derived field that recognises that ID ("trace_id":"([0-9a-f]+)") and renders it as a button linking straight into Tempo. An engineer reading a 5xx log line clicks the trace ID and is looking at the full waterfall for that exact request — every span, every service it touched, every DB and RPC call, and precisely where the time or the error went.

Traces → logs. The reverse works too. Tempo's trace-to-logs is wired back to Loki (with a 15-minute window either side of the span) so from any span you can jump to the surrounding log context.

Traces → metrics. And trace-to-metrics links spans back to the relevant Prometheus series.

The result is the workflow that the whole project was really about. "Users are getting errors" becomes: open the dashboard → see the 5xx rate is up → scroll to the embedded Recent 5xx Requests log panel → click a line → copy the trace ID → land in the trace waterfall → see the failing span. A minute, maybe two, and no SSH.

Dashboards: emergencies at a glance, deep dives one click away

We built the dashboards in two registers — the glance and the dig — because those are two different jobs.

The glance. Every service dashboard opens with a status row designed to be readable across a room: Origins Up, Proxy Targets Up, App Targets Up, Redis Up, request rate, 4xx %, 5xx %, P95 latency. Green is fine, anything else is a question. Below that, host capacity (CPU, RAM, disk, load) per instance, traffic broken down by route / status code / instance / region, the Top 5 Slowest Routes table, and Redis internals (used vs maxmemory, connected clients, commands/sec). The Web2 API, which leans hard on the blockchain, adds panels we don't need elsewhere: RPC operations by outcome, dependency latency and error rates, the memecoin job queue (waiting vs active vs throughput by status), and Node.js runtime health — event-loop lag, process memory, open file descriptors. It even plots request geography from the Cloudflare cf_country header that rides through in the logs.

The dig. Every dashboard has an "Explore API logs" link at the top that opens Explore pre-filtered to that service and time range, plus embedded panels — Recent 5xx, Recent Slow Dependencies, Recent Unmatched Requests — that turn the dashboard itself into the first step of an investigation. (Unmatched requests are a quietly useful signal: a surge of traffic hitting no known route usually means someone's scanning you.)

Access is Google SSO, the same identity the team already uses, so there's no separate credential to manage or lose.

Watching the front of the house — not just "is it 200?"

This was the part of the brief we cared about most, because it's the part most monitoring gets lazily wrong. A website returning 200 OK is not the same as a website that works. An app shell can render, return 200, and be backed by a dead API. So we monitor in layers.

Layer 1 — is it reachable? A blackbox exporter probes the three public sites (flaunch.gg, takeover.fun, slab.cash) every 30 seconds with the http_2xx module, recording success, status code, response duration, and certificate expiry.

Layer 2 — is the API actually alive behind it? We don't just probe the homepage. We probe dedicated health endpoints — e.g. takeover.fun/api/health — as their own targets. The site being up and its API being up are two separate facts, alerted separately, so a backend outage can't hide behind a happy front end.

Layer 3 — is it returning the right data? This is the bit we're proudest of. For the Graph query layer — the data backbone — a plain 200 is meaningless; what matters is whether it's returning fresh, valid blockchain state. So we wrote a custom blackbox module, http_graph_response, that fires a real GraphQL query every 5 seconds against both the EU and US Graph endpoints and validates the body: that _meta.block comes back with a real hash, number, parent hash, and timestamp. If the Graph stops advancing or starts returning garbage, we know in seconds — long before a user notices stale data. That's the difference between "the server answered" and "the server is correct."

Certificates, too. The same probes track SSL expiry, with alerts that warn at 7 days out and page at 24 hours. Nobody on this team will ever be the person who let a cert lapse on a production domain.

Alerting: the right noise, to the right place, at the right severity

Metrics and dashboards are worthless if the right person doesn't get woken up. Alerts run through Alertmanager with a deliberate two-tier severity model:

warning / critical → a Discord channel, mentioning the responsible team. The stuff you want to know about and look at, but that isn't on fire.
page → Discord and an SMS via a self-hosted Sachet receiver wired to Twilio. Reserved for the genuine "a customer-facing thing is down right now" alerts — a public site unreachable, an SSL cert hours from expiry, a service below quorum.

The rules encode real SLOs rather than vibes. A few representative ones:

5xx rate > 2% for 5 minutes → critical. P95 latency > 1s (2s on the heavier Web2 API) for 10 minutes → warning.
Service below quorum — fewer than half the instances of a service up — → page. Losing one of four hosts is a warning; losing the service is an emergency, and the alerts know the difference.
Host capacity laddered: CPU > 85% warns, > 95% goes critical; disk > 85% warns, > 92% critical — so you get a nudge with time to act before you get a crisis.
RPC failure rate > 10%, or all RPC calls failing, → critical on the service that depends on the chain — because for that API a dead RPC is a dead API.
Queue backlog and stalls — jobs waiting with no active worker → critical; that's a launch pipeline grinding to a halt.

Every alert carries a service label that routes it to the team that owns it, so the noise lands where it's actionable and nowhere else.

What we'd underline

A few principles did the heavy lifting, and they're portable to almost any platform:

Keep observability sideband. If the monitoring plane dies, production must not. That single rule lets you centralise aggressively without adding risk.
Probe behaviour, not just availability. A 200 is the weakest possible signal. Health endpoints and content-validating synthetic checks are what actually tell you the system works.
Store cold data where reads are free. Logs and traces are write-heavy and read-rare. Cloudflare R2's zero egress turns retention from a budget conversation into a config value.
Invest in the joins. One trace ID, clickable from a log line into a full waterfall, is worth more than a dozen extra panels.
Match severity to consequence. Two tiers — Discord for "look at this," SMS for "wake up" — keeps people responsive instead of numb.

Need help with something similar?

Turn the rough edge into a simpler system.

I help teams with technical architecture, DevOps, CI/CD, performance, replatforming, and practical AI process improvement.

Start a conversation