OpShift - Team Alert Management

Stop Monitoring Everything Every 30 Seconds

The default instinct when setting up uptime monitoring is to check everything as frequently as possible. 30-second intervals, immediate alerts on first failure, no tolerance for anything less than 200 OK.

This approach generates noise. Lots of noise. A momentary network hiccup, a brief DNS resolution delay, or a garbage collection pause in your application can all trigger false positives that pull engineers away from real work.

Good monitoring isn't about checking as often as possible. It's about checking at the right interval, with the right failure threshold, and with the right grace period for each service.

Choosing the Right Monitoring Interval

The monitoring interval should match the criticality and expected behavior of the service:

Service Type	Recommended Interval	Rationale
Payment/checkout flow	30-60 seconds	Revenue-critical, immediate impact
Production API	60 seconds	User-facing, but can tolerate brief blips
Marketing website	300 seconds (5 min)	Important but not real-time critical
Internal tools	300-600 seconds	Used during business hours, lower urgency
Status page	300 seconds	Meta-monitoring, moderate frequency
Staging/dev environments	600-3600 seconds	Non-critical, just need trend data
Batch job endpoints	3600 seconds (1 hr)	Only need to verify they're alive

The key principle: shorter intervals generate more data, more aggregate records, and more potential false positives. Only use short intervals for services where every minute of downtime matters.

Understanding Failure Thresholds

A failure threshold defines how many consecutive failures must occur before an alert fires. This is your primary defense against false positives.

Threshold of 1 (alert on first failure):

Use for: Payment systems, authentication services
Risk: High false positive rate from transient issues
Best paired with: Short interval (30-60s) for fast recovery detection

Threshold of 2-3 (alert after 2-3 consecutive failures):

Use for: Most production services
Risk: Delays alert by 1-2 intervals (60-180 seconds at 60s interval)
Best paired with: Standard interval (60s), balances speed and accuracy

Threshold of 5+ (alert after 5+ consecutive failures):

Use for: Services with known intermittent issues, external dependencies
Risk: Significant delay before alerting (5+ minutes at 60s interval)
Best paired with: Shorter interval to compensate for the higher threshold

The math is straightforward: time to alert = interval x threshold. A 60-second interval with a threshold of 3 means you'll be alerted within 3 minutes of a real outage.

Grace Periods: Preventing Alert Flapping

Alert flapping occurs when a service bounces between up and down states rapidly, generating a storm of "down" and "recovered" notifications. This is common with:

Services behind load balancers (some instances healthy, some not)
Services with aggressive timeout settings
Services experiencing high load (intermittent timeouts)
Services with DNS propagation issues

A grace period defines how long a service must stay in its new state before the state change is confirmed. A 60-second grace period means:

Monitor detects failure (threshold reached)
Grace period starts — system waits 60 seconds
If service recovers during grace period, no alert fires
If service is still down after grace period, alert fires

This eliminates the "down for 30 seconds, back up, down again for 15 seconds" pattern that generates three or four notifications for what's really a single brief instability.

Recommended Configurations by Service Type

Here are battle-tested configurations for common service types:

Payment and Checkout Services

Interval: 30 seconds
Failure threshold: 1
Grace period: 0 (immediate alert)
Alert severity: Critical
Rationale: Every second of payment downtime is lost revenue

Production APIs

Interval: 60 seconds
Failure threshold: 3
Grace period: 60 seconds
Alert severity: High
Rationale: Balances fast detection with false positive prevention

Marketing and Content Sites

Interval: 300 seconds
Failure threshold: 2
Grace period: 120 seconds
Alert severity: Medium
Rationale: Downtime is bad for SEO and brand but not immediately revenue-impacting

Internal Tools and Dashboards

Interval: 600 seconds
Failure threshold: 3
Grace period: 300 seconds
Alert severity: Low
Rationale: Used during business hours, users can work around brief outages

Third-Party Dependencies

Interval: 300 seconds
Failure threshold: 5
Grace period: 300 seconds
Alert severity: Medium
Rationale: You can't fix third-party outages; high threshold reduces noise for issues outside your control

Advanced: Interval-Aware Data Aggregation

One often-overlooked aspect of monitoring intervals is how they affect data aggregation and reporting.

A monitor checking every 60 seconds generates 1,440 data points per day. A monitor checking every 5 minutes generates 288. A monitor checking every hour generates 24.

Smart monitoring systems adjust their aggregation strategy based on the interval:

30-60 second intervals: Minute-level, 5-minute, hourly, and daily aggregates
1-5 minute intervals: 5-minute, hourly, and daily aggregates (skip minute-level)
5-60 minute intervals: Hourly and daily aggregates only
1+ hour intervals: Daily aggregates only

This prevents the absurdity of creating minute-level aggregates for a monitor that only pings once per hour — which would show "0.0167 expected pings per minute" and confuse everyone looking at the dashboard.

Common Anti-Patterns to Avoid

Anti-pattern: Uniform intervals for all monitors Don't set every monitor to 30 seconds. Your staging environment doesn't need the same check frequency as your production payment system.

Anti-pattern: Threshold of 1 everywhere One failed health check is not an outage. Transient failures happen. Use thresholds of 2-3 for most services.

Anti-pattern: No grace period Without grace periods, services that flap during deployments or auto-scaling events generate storms of alternating up/down notifications.

Anti-pattern: Ignoring timeout configuration A 30-second timeout on a 30-second interval means a slow response (not a failure) consumes the entire interval. Set timeouts to 25-50% of the interval.

Anti-pattern: Monitoring only the health check endpoint Health check endpoints (/health, /ping) often return 200 even when the application is broken. Monitor actual user-facing endpoints or critical flows when possible.

Getting Your Monitoring Right

The goal isn't maximum monitoring — it's effective monitoring. Match your intervals to service criticality, use failure thresholds to eliminate transient noise, and add grace periods to prevent flapping.

OpShift's monitoring system supports configurable intervals, failure thresholds, and grace periods. It also uses interval-aware aggregation, so your dashboard shows meaningful data regardless of check frequency. Available at $14/month for up to 50 users, with no per-seat pricing. Get started at opshift.io.

Uptime Monitoring Done Right: Intervals, Thresholds, and Grace Periods