The Hidden Cost of One-Size-Fits-All Monitoring Aggregation
Here's a number that probably makes your database administrator twitch: a single uptime monitor that pings once per hour, aggregated into minute-level buckets for an entire month, generates 43,200 rows — 1,440 per day, most of which represent the same information: "no ping arrived, as expected."
Multiply that by a few hundred monitors and you've got millions of rows of meaningless aggregates. And the cruel irony is that none of them are even useful: a minute-level chart of an hourly monitor is mostly empty, and the "expected pings per minute" value is a fraction that no user interface has ever rendered sensibly.
This is what happens when a monitoring system applies the same aggregation strategy to every monitor regardless of how often it actually reports. It's a problem we hit head-on while building OpShift, and the fix reshaped how we think about time-series rollups entirely.
Why Aggregation Matters in Monitoring
Before we get into the fix, it's worth being explicit about why you aggregate pings in the first place.
Raw pings are great for the last few hours: you can see individual successes and failures, compute exact response times, reconstruct timelines. But raw pings don't scale for charts. If you want to render a 30-day uptime chart with one point per day, you don't want your database to scan 2.5 million raw pings every time someone opens the dashboard.
So you pre-aggregate. You take the raw pings and compute rolled-up stats at various granularities — per minute, per hour, per day — and store them as cheap-to-query summaries. Charts hit the aggregates; forensics hit the raw data.
The standard approach is to build every level of aggregation for every monitor. Minute, 5-minute, hour, day — the full cascade, for all of them. It's tidy. It's uniform. And it wastes a staggering amount of storage and compute on monitors that don't need it.
The Fractional Expected-Pings Problem
The first sign that something was wrong in our original implementation wasn't a database bill. It was a rendering bug.
A user reported that their hourly monitor's chart was showing an "expected pings" value of 0.02. Which is technically correct — at 60-minute intervals, you expect 1/60th of a ping per minute bucket — but utterly useless to a human reading a dashboard. And in some views it was getting rounded to zero, which broke uptime percentage calculations entirely (you can't divide by zero, and if you do, you get weird numbers).
The deeper problem wasn't the display. It was the conceptual one: minute-level aggregation is meaningless for a monitor that pings once an hour. There's no information there. We were computing statistics about buckets that would, by design, be empty 59 times out of 60.
Tiered Aggregation: Matching Granularity to Reality
The fix was to stop treating aggregation as a uniform cascade and start treating it as a decision based on the monitor's ping interval.
The rule is simple: don't aggregate at a finer granularity than the monitor reports. A monitor that pings every minute can have minute-level aggregates (there's something to summarize). A monitor that pings every hour cannot (the minute buckets are almost all empty).
Concretely, we define four tiers:
| Ping interval | Aggregates produced |
|---|---|
| ≤ 60 seconds | minute, 5-minute, hourly, daily |
| ≤ 300 seconds | 5-minute, hourly, daily |
| ≤ 3600 seconds | hourly, daily |
| > 3600 seconds | daily only |
A monitor pinging every 5 minutes skips minute buckets entirely. A monitor pinging every 6 hours gets only daily rollups. The aggregate exists if and only if there's something meaningful to summarize.
The Numbers
Let's quantify what this actually saves for a representative fleet.
Take a monitor that pings every 6 hours, retained for 30 days:
- Old: 43,200 minute records + 8,640 5-minute records + 720 hourly records + 30 daily records = 52,590 rows.
- New: 30 daily records. ~99.9% reduction.
A monitor pinging every 5 minutes:
- Old: 43,200 + 8,640 + 720 + 30 = 52,590 rows.
- New: 8,640 + 720 + 30 = 9,390 rows. ~82% reduction.
Even a fast-pinging monitor (every minute) saves nothing individually — it legitimately needs every tier — but that's fine. The fleet-wide savings come from the long-tail monitors that don't.
Across a typical team's monitors, we saw storage drop by roughly 60–70%, and aggregation cron jobs got 3–4x faster because they stopped iterating over monitors they had no work to do for.
The Cron Side of the Change
The cron jobs that build these aggregates also got simpler logic. Instead of blindly iterating all monitors at each tier, each job filters first:
- Minute aggregation cron (every 1m):
interval <= 60 - 5-minute aggregation cron (every 5m):
interval <= 300 - Hourly aggregation cron (every 15m):
interval <= 3600 - Daily aggregation cron (every 1h): all monitors
The hourly cron runs every 15 minutes rather than hourly — that keeps the current, in-progress hour bucket fresh so users viewing 24h charts see today's data update in near real-time. Same story for the daily cron running hourly.
And here's the detail that matters: the aggregations are hierarchical. The hourly job doesn't re-scan raw pings if 5-minute aggregates already exist for that hour. It sums the twelve 5-minute buckets. The daily job sums 24 hourly buckets. Each tier builds on the one below it, which is how a cron running every hour can keep 30 days of daily aggregates current without touching a single raw ping.
The 100% Uptime Fix
While we were in there, we fixed another subtle bug that the old heartbeat-based calculation had been hiding.
Consider a 1-minute monitor. It pings at 10:00:30 (successfully). The bucket is 10:00:00 – 10:00:59. The old calculation would compute "uptime = 30 seconds" (from the first ping to the end of the bucket), yielding 50% uptime for that minute.
But the monitor was working perfectly. It was expected to ping once in that minute. It did. The expectation was fully met. The "missing" 30 seconds before the first ping is an artifact of where the clock happens to fall relative to when the ping arrived — not a monitoring failure.
The fix:
// If all expected pings arrived and all succeeded, it's 100% uptime.
// Don't penalize for time before the first ping if expectations are met.
if (expectedPings > 0 && upPingCount >= expectedPings && downPingCount === 0) {
heartbeatStats.status = 'up';
heartbeatStats.uptimeSeconds = bucketSeconds;
heartbeatStats.downtimeSeconds = 0;
heartbeatStats.incidentCount = 0;
}
Simple rule: if we got every ping we expected and none of them failed, it's 100%. This override applies at every aggregate tier (minute, 5-minute, hourly), and daily inherits it since daily is computed from hourly.
Incomplete Periods: The Other Hidden Bug
One more subtlety worth mentioning, because it plagues most homegrown monitoring systems.
Imagine it's 12:06 PM and you're looking at the current-hour bucket (12:00 – 13:00). A 1-minute monitor has pinged successfully 6 times so far. What's the expected ping count for the bucket?
The naive answer is 60. The correct answer is 6.
If you use 60, you get 6/60 = 10% uptime for the current hour — an obviously wrong number that leaks into the dashboard. If you use 6, you get 6/6 = 100% — which is exactly what's true.
The fix is to detect incomplete periods and use elapsed time rather than full-bucket time:
const now = new Date();
const isIncomplete = bucketEnd > now;
const expectedPings = isIncomplete
? Math.floor((now.getTime() - bucketStart.getTime()) / 1000 / monitor.interval)
: Math.floor(bucketDurationSeconds / monitor.interval);
Applied at every tier. The current minute, current 5-minute window, and current hour all use elapsed time. Only completed buckets use the full duration.
The Retention Numbers
Once aggregation became interval-aware, we could also right-size retention. The table:
| Tier | Retention |
|---|---|
| Raw pings | 7 days |
| Minute | 1 hour |
| 5-minute | 48 hours |
| Hourly | 30 days |
| Daily | 90 days |
The minute aggregates only exist to serve the 1-hour live view — no reason to keep them longer. The 5-minute aggregates feed the 6-hour view. Hourly feeds the 24h/7d views. Daily feeds the 30d/90d views. Each tier's retention matches the longest chart it drives, no more.
What Changes for the Frontend
The final piece is the chart UI. With interval-aware aggregation, not every monitor can render every period. A monitor pinging once a day has no meaningful 1-hour chart.
So the period selector filters dynamically based on the monitor's interval:
import { getAvailablePeriods, getMinChartPeriod } from '@/lib/services/aggregation-strategy';
const periods = getAvailablePeriods(monitor.interval);
// fast monitor -> ['1h', '6h', '24h', '7d', '30d', '90d']
// daily monitor -> ['7d', '30d', '90d']
const defaultPeriod = getMinChartPeriod(monitor.interval);
The backend enforces this too — requesting an unavailable period for a given monitor returns a 400 rather than rendering a broken chart. The frontend and backend agree on what's meaningful.
What We Learned
The meta-lesson here is that "uniform" isn't always "correct." Treating every monitor the same feels clean and consistent, but it ends up producing statistics about buckets that have no statistics to produce, rendering UIs for periods that can't be meaningfully rendered, and filling databases with rows that communicate nothing.
The corrective is to let the data's natural granularity drive the aggregation strategy, not the other way around. A monitor that pings hourly is a fundamentally different kind of thing from one that pings every second, and the system should reflect that — in storage, in compute, and in the UI.
If you're building or operating a monitoring system today, the question to ask is: does every row in our aggregates table communicate something true that couldn't be communicated more efficiently? If the answer is no — and for most systems doing blanket multi-tier aggregation, it's no — there's a 60% storage saving and a materially faster cron sitting right there.
