Monitors

Passive heartbeats that detect when your service stops checking in.

A monitor is an endpoint that expects your service to ping it on a fixed cadence. If a ping arrives, the monitor stays up. If a ping doesn't arrive within the configured grace window, OpShift flips it to down and raises an alert. This pattern catches the failures that external HTTP probes miss — cron jobs that silently exit, queue consumers that wedge, background workers that crash without restarting.

Anatomy of a monitor

Field	Type	Description
`name`required	`string`	Human-readable label shown in the dashboard and alerts.
`interval`required	`seconds`	How often you expect to ping. The next-expected timestamp is computed from the last ping plus this value.
`gracePeriod`required	`seconds`	Slack on top of the interval before a missed ping fires an alert. Use this to absorb minor timing drift.
`flapDetectionEnabled`	`boolean`	When enabled, rapid up↔down oscillation within the flap window is suppressed and surfaced as a single 'flapping' state.
`pingUrl`	`string`	The unique URL-safe identifier appended to /api/ping/. Treat it as a public token — it is paired with your HMAC signature for authentication.

The status state machine

Each monitor has a single status at any time. The dashboard surfaces these values directly:

idle — created, but no ping has ever arrived. No alerts will fire yet.
up — last ping arrived on time with status: "up".
down — either a ping arrived with status: "down", or the grace period elapsed with no ping.
paused — explicitly paused from the dashboard. Pings to a paused monitor return HTTP 404.

What triggers a transition

Transitions are emitted either by your explicit pings or by the background cron that watches for missed heartbeats:

First ping ever — establishes the monitoring baseline.
up → down via timeout — cron observed now > lastPing + interval + grace.
up → down via explicit ping — your code reported a failure by sending status: "down".
down → up via explicit ping — service recovered. Any open alert is marked recovered automatically.
down → down with new reason — the reason field changed (e.g. timeout → db-timeout) so the timeline records the shift.

Sending pings

The ping endpoint accepts an optional JSON body. The minimum viable ping is a POST with no body — that's treated as status: "up".

typescript

import { OpShift } from "@opshift/sdk";
 
const opshift = new OpShift({ secret: process.env.OPSHIFT_SECRET! });
 
// Successful run
await opshift.ping("payments-worker", {
  status: "up",
  metadata: { processed: 1247, durationMs: 832 },
});
 
// Failure — flips the monitor to down and fires an alert
await opshift.ping("payments-worker", {
  status: "down",
  reason: "stripe-api-timeout",
  metadata: { lastError: err.message },
});

Optional fields

Field	Type	Description
`status`	`"up" \| "down"`	Defaults to "up" if omitted. Sending "down" raises an incident immediately.
`reason`	`string`	Short label (max 200 chars) that drives down→down deduplication. Changing the reason is treated as a new incident phase.
`group_key`	`string`	Routing hint used at alert time to group related failures into one incident.
`metadata`	`Record<string, unknown>`	Arbitrary key-value payload stored verbatim on the ping row. Surfaces in alerts and the activity timeline.

Flap detection

Real-world services can oscillate during an incident — a flaky database connection, a slow leader election, a saturated dependency. Without suppression, every flip would page on-call. Enable flap detection on a monitor to collapse rapid up↔down transitions into a single flapping state and a single alert.

Field	Type	Description
`flapThreshold`	`number`	How many transitions within the window count as flapping. Default is 4.
`flapWindowMinutes`	`minutes`	Sliding window over which transitions are counted. Default is 10.

Don't add a second dedup layer

The down-alert pipeline already handles active-alert dedup, recovered flagging, concurrent crons, and flap suppression. If you find yourself building your own ping coalescing in the caller, that's usually a signal that the interval or grace period needs tuning instead.

Rate limits and quotas

Each monitor is rate-limited to 10 pings per minute per ping URL. Exceeding it returns HTTP 429.
Heartbeats count against your team's monthly usage quota. If the quota is exceeded, pings return 429 with a structured reason.
If ClickHouse is briefly unreachable during a heartbeat, the ping is still accepted with default state — by design. A backend blip should never silently drop your monitoring data.

Quickstart Alerts