A monitor is an endpoint that expects your service to ping it on a fixed cadence. If a ping arrives, the monitor stays up. If a ping doesn't arrive within the configured grace window, OpShift flips it to down and raises an alert. This pattern catches the failures that external HTTP probes miss — cron jobs that silently exit, queue consumers that wedge, background workers that crash without restarting.
Anatomy of a monitor
| Field | Type | Description |
|---|---|---|
namerequired | string | Human-readable label shown in the dashboard and alerts. |
intervalrequired | seconds | How often you expect to ping. The next-expected timestamp is computed from the last ping plus this value. |
gracePeriodrequired | seconds | Slack on top of the interval before a missed ping fires an alert. Use this to absorb minor timing drift. |
flapDetectionEnabled | boolean | When enabled, rapid up↔down oscillation within the flap window is suppressed and surfaced as a single 'flapping' state. |
pingUrl | string | The unique URL-safe identifier appended to /api/ping/. Treat it as a public token — it is paired with your HMAC signature for authentication. |
The status state machine
Each monitor has a single status at any time. The dashboard surfaces these values directly:
- idle — created, but no ping has ever arrived. No alerts will fire yet.
- up — last ping arrived on time with
status: "up". - down — either a ping arrived with
status: "down", or the grace period elapsed with no ping. - paused — explicitly paused from the dashboard. Pings to a paused monitor return HTTP 404.
What triggers a transition
Transitions are emitted either by your explicit pings or by the background cron that watches for missed heartbeats:
- First ping ever — establishes the monitoring baseline.
- up → down via timeout — cron observed
now > lastPing + interval + grace. - up → down via explicit ping — your code reported a failure by sending
status: "down". - down → up via explicit ping — service recovered. Any open alert is marked recovered automatically.
- down → down with new reason — the
reasonfield changed (e.g. timeout → db-timeout) so the timeline records the shift.
Sending pings
The ping endpoint accepts an optional JSON body. The minimum viable ping is a POST with no body — that's treated as status: "up".
import { OpShift } from "@opshift/sdk";
const opshift = new OpShift({ secret: process.env.OPSHIFT_SECRET! });
// Successful run
await opshift.ping("payments-worker", {
status: "up",
metadata: { processed: 1247, durationMs: 832 },
});
// Failure — flips the monitor to down and fires an alert
await opshift.ping("payments-worker", {
status: "down",
reason: "stripe-api-timeout",
metadata: { lastError: err.message },
});Optional fields
| Field | Type | Description |
|---|---|---|
status | "up" | "down" | Defaults to "up" if omitted. Sending "down" raises an incident immediately. |
reason | string | Short label (max 200 chars) that drives down→down deduplication. Changing the reason is treated as a new incident phase. |
group_key | string | Routing hint used at alert time to group related failures into one incident. |
metadata | Record<string, unknown> | Arbitrary key-value payload stored verbatim on the ping row. Surfaces in alerts and the activity timeline. |
Flap detection
Real-world services can oscillate during an incident — a flaky database connection, a slow leader election, a saturated dependency. Without suppression, every flip would page on-call. Enable flap detection on a monitor to collapse rapid up↔down transitions into a single flapping state and a single alert.
| Field | Type | Description |
|---|---|---|
flapThreshold | number | How many transitions within the window count as flapping. Default is 4. |
flapWindowMinutes | minutes | Sliding window over which transitions are counted. Default is 10. |
The down-alert pipeline already handles active-alert dedup, recovered flagging, concurrent crons, and flap suppression. If you find yourself building your own ping coalescing in the caller, that's usually a signal that the interval or grace period needs tuning instead.
Rate limits and quotas
- Each monitor is rate-limited to 10 pings per minute per ping URL. Exceeding it returns HTTP 429.
- Heartbeats count against your team's monthly usage quota. If the quota is exceeded, pings return 429 with a structured reason.
- If ClickHouse is briefly unreachable during a heartbeat, the ping is still accepted with default state — by design. A backend blip should never silently drop your monitoring data.