Monitors

Passive heartbeats that detect when your service stops checking in.

A monitor is an endpoint that expects your service to ping it on a fixed cadence. If a ping arrives, the monitor stays up. If a ping doesn't arrive within the configured grace window, OpShift flips it to down and raises an alert. This pattern catches the failures that external HTTP probes miss — cron jobs that silently exit, queue consumers that wedge, background workers that crash without restarting.

Anatomy of a monitor

FieldTypeDescription
namerequiredstringHuman-readable label shown in the dashboard and alerts.
intervalrequiredsecondsHow often you expect to ping. The next-expected timestamp is computed from the last ping plus this value.
gracePeriodrequiredsecondsSlack on top of the interval before a missed ping fires an alert. Use this to absorb minor timing drift.
flapDetectionEnabledbooleanWhen enabled, rapid up↔down oscillation within the flap window is suppressed and surfaced as a single 'flapping' state.
pingUrlstringThe unique URL-safe identifier appended to /api/ping/. Treat it as a public token — it is paired with your HMAC signature for authentication.

The status state machine

Each monitor has a single status at any time. The dashboard surfaces these values directly:

  • idle — created, but no ping has ever arrived. No alerts will fire yet.
  • up — last ping arrived on time with status: "up".
  • down — either a ping arrived with status: "down", or the grace period elapsed with no ping.
  • paused — explicitly paused from the dashboard. Pings to a paused monitor return HTTP 404.

What triggers a transition

Transitions are emitted either by your explicit pings or by the background cron that watches for missed heartbeats:

  • First ping ever — establishes the monitoring baseline.
  • up → down via timeout — cron observed now > lastPing + interval + grace.
  • up → down via explicit ping — your code reported a failure by sending status: "down".
  • down → up via explicit ping — service recovered. Any open alert is marked recovered automatically.
  • down → down with new reason — the reason field changed (e.g. timeoutdb-timeout) so the timeline records the shift.

Sending pings

The ping endpoint accepts an optional JSON body. The minimum viable ping is a POST with no body — that's treated as status: "up".

typescript
import { OpShift } from "@opshift/sdk";
 
const opshift = new OpShift({ secret: process.env.OPSHIFT_SECRET! });
 
// Successful run
await opshift.ping("payments-worker", {
  status: "up",
  metadata: { processed: 1247, durationMs: 832 },
});
 
// Failure — flips the monitor to down and fires an alert
await opshift.ping("payments-worker", {
  status: "down",
  reason: "stripe-api-timeout",
  metadata: { lastError: err.message },
});

Optional fields

FieldTypeDescription
status"up" | "down"Defaults to "up" if omitted. Sending "down" raises an incident immediately.
reasonstringShort label (max 200 chars) that drives down→down deduplication. Changing the reason is treated as a new incident phase.
group_keystringRouting hint used at alert time to group related failures into one incident.
metadataRecord<string, unknown>Arbitrary key-value payload stored verbatim on the ping row. Surfaces in alerts and the activity timeline.

Flap detection

Real-world services can oscillate during an incident — a flaky database connection, a slow leader election, a saturated dependency. Without suppression, every flip would page on-call. Enable flap detection on a monitor to collapse rapid up↔down transitions into a single flapping state and a single alert.

FieldTypeDescription
flapThresholdnumberHow many transitions within the window count as flapping. Default is 4.
flapWindowMinutesminutesSliding window over which transitions are counted. Default is 10.

Rate limits and quotas

  • Each monitor is rate-limited to 10 pings per minute per ping URL. Exceeding it returns HTTP 429.
  • Heartbeats count against your team's monthly usage quota. If the quota is exceeded, pings return 429 with a structured reason.
  • If ClickHouse is briefly unreachable during a heartbeat, the ping is still accepted with default state — by design. A backend blip should never silently drop your monitoring data.