Stop Monitoring Everything Every 30 Seconds
The default instinct when setting up uptime monitoring is to check everything as frequently as possible. 30-second intervals, immediate alerts on first failure, no tolerance for anything less than 200 OK.
This approach generates noise. Lots of noise. A momentary network hiccup, a brief DNS resolution delay, or a garbage collection pause in your application can all trigger false positives that pull engineers away from real work.
Good monitoring isn't about checking as often as possible. It's about checking at the right interval, with the right failure threshold, and with the right grace period for each service.
Choosing the Right Monitoring Interval
The monitoring interval should match the criticality and expected behavior of the service:
| Service Type | Recommended Interval | Rationale |
|---|---|---|
| Payment/checkout flow | 30-60 seconds | Revenue-critical, immediate impact |
| Production API | 60 seconds | User-facing, but can tolerate brief blips |
| Marketing website | 300 seconds (5 min) | Important but not real-time critical |
| Internal tools | 300-600 seconds | Used during business hours, lower urgency |
| Status page | 300 seconds | Meta-monitoring, moderate frequency |
| Staging/dev environments | 600-3600 seconds | Non-critical, just need trend data |
| Batch job endpoints | 3600 seconds (1 hr) | Only need to verify they're alive |
The key principle: shorter intervals generate more data, more aggregate records, and more potential false positives. Only use short intervals for services where every minute of downtime matters.
Understanding Failure Thresholds
A failure threshold defines how many consecutive failures must occur before an alert fires. This is your primary defense against false positives.
Threshold of 1 (alert on first failure):
- Use for: Payment systems, authentication services
- Risk: High false positive rate from transient issues
- Best paired with: Short interval (30-60s) for fast recovery detection
Threshold of 2-3 (alert after 2-3 consecutive failures):
- Use for: Most production services
- Risk: Delays alert by 1-2 intervals (60-180 seconds at 60s interval)
- Best paired with: Standard interval (60s), balances speed and accuracy
Threshold of 5+ (alert after 5+ consecutive failures):
- Use for: Services with known intermittent issues, external dependencies
- Risk: Significant delay before alerting (5+ minutes at 60s interval)
- Best paired with: Shorter interval to compensate for the higher threshold
The math is straightforward: time to alert = interval x threshold. A 60-second interval with a threshold of 3 means you'll be alerted within 3 minutes of a real outage.
Grace Periods: Preventing Alert Flapping
Alert flapping occurs when a service bounces between up and down states rapidly, generating a storm of "down" and "recovered" notifications. This is common with:
- Services behind load balancers (some instances healthy, some not)
- Services with aggressive timeout settings
- Services experiencing high load (intermittent timeouts)
- Services with DNS propagation issues
A grace period defines how long a service must stay in its new state before the state change is confirmed. A 60-second grace period means:
- Monitor detects failure (threshold reached)
- Grace period starts — system waits 60 seconds
- If service recovers during grace period, no alert fires
- If service is still down after grace period, alert fires
This eliminates the "down for 30 seconds, back up, down again for 15 seconds" pattern that generates three or four notifications for what's really a single brief instability.
Recommended Configurations by Service Type
Here are battle-tested configurations for common service types:
Payment and Checkout Services
- Interval: 30 seconds
- Failure threshold: 1
- Grace period: 0 (immediate alert)
- Alert severity: Critical
- Rationale: Every second of payment downtime is lost revenue
Production APIs
- Interval: 60 seconds
- Failure threshold: 3
- Grace period: 60 seconds
- Alert severity: High
- Rationale: Balances fast detection with false positive prevention
Marketing and Content Sites
- Interval: 300 seconds
- Failure threshold: 2
- Grace period: 120 seconds
- Alert severity: Medium
- Rationale: Downtime is bad for SEO and brand but not immediately revenue-impacting
Internal Tools and Dashboards
- Interval: 600 seconds
- Failure threshold: 3
- Grace period: 300 seconds
- Alert severity: Low
- Rationale: Used during business hours, users can work around brief outages
Third-Party Dependencies
- Interval: 300 seconds
- Failure threshold: 5
- Grace period: 300 seconds
- Alert severity: Medium
- Rationale: You can't fix third-party outages; high threshold reduces noise for issues outside your control
Advanced: Interval-Aware Data Aggregation
One often-overlooked aspect of monitoring intervals is how they affect data aggregation and reporting.
A monitor checking every 60 seconds generates 1,440 data points per day. A monitor checking every 5 minutes generates 288. A monitor checking every hour generates 24.
Smart monitoring systems adjust their aggregation strategy based on the interval:
- 30-60 second intervals: Minute-level, 5-minute, hourly, and daily aggregates
- 1-5 minute intervals: 5-minute, hourly, and daily aggregates (skip minute-level)
- 5-60 minute intervals: Hourly and daily aggregates only
- 1+ hour intervals: Daily aggregates only
This prevents the absurdity of creating minute-level aggregates for a monitor that only pings once per hour — which would show "0.0167 expected pings per minute" and confuse everyone looking at the dashboard.
Common Anti-Patterns to Avoid
Anti-pattern: Uniform intervals for all monitors Don't set every monitor to 30 seconds. Your staging environment doesn't need the same check frequency as your production payment system.
Anti-pattern: Threshold of 1 everywhere One failed health check is not an outage. Transient failures happen. Use thresholds of 2-3 for most services.
Anti-pattern: No grace period Without grace periods, services that flap during deployments or auto-scaling events generate storms of alternating up/down notifications.
Anti-pattern: Ignoring timeout configuration A 30-second timeout on a 30-second interval means a slow response (not a failure) consumes the entire interval. Set timeouts to 25-50% of the interval.
Anti-pattern: Monitoring only the health check endpoint Health check endpoints (/health, /ping) often return 200 even when the application is broken. Monitor actual user-facing endpoints or critical flows when possible.
Getting Your Monitoring Right
The goal isn't maximum monitoring — it's effective monitoring. Match your intervals to service criticality, use failure thresholds to eliminate transient noise, and add grace periods to prevent flapping.
OpShift's monitoring system supports configurable intervals, failure thresholds, and grace periods. It also uses interval-aware aggregation, so your dashboard shows meaningful data regardless of check frequency. Available at $14/month for up to 50 users, with no per-seat pricing. Get started at opshift.io.
