Building On-Call Schedules That Don't Burn Out Your Team
Ask a senior engineer why they left their last job and "on-call" will show up in the answer more often than compensation. Not because being on-call is inherently terrible — most engineers accept that running production means sharing the pager — but because the way most companies design their schedules turns a manageable responsibility into a slow-burn quality-of-life problem.
The interesting thing is that the fix almost never involves better pagers or smarter routing or fancier escalation. It involves rethinking the schedule as a social contract, not a spreadsheet.
What Fair On-Call Actually Means
"Fair on-call" isn't about equal distribution of hours. Two engineers who each take one week a month on paper can have wildly different experiences: one gets paged six times at 3 AM, the other gets paged twice during business hours. Equal schedule ≠ equal burden.
Fairness, in practice, is a function of:
- Predictability. Can I plan my evenings? Can I commit to a date night?
- Recovery time. If I get paged tonight, am I off tomorrow?
- Support. If I'm stuck, who backs me up — fast — without guilt?
- Control. Can I swap shifts when life happens, without asking a manager?
- Transparency. Do I know what I'm signing up for before I sign up for it?
A schedule that delivers on these five is one people don't dread. Skip any of them and you've got a retention problem disguised as an operations problem.
The Five Design Levers
Every on-call schedule is some combination of the same five levers. Get them right and the rest tends to follow.
1. Rotation Length
The default at most companies is "weekly." It's the wrong default for most teams.
A weekly rotation on a quiet service is fine. A weekly rotation on a noisy service is a one-way ticket to burnout — seven straight days of disrupted sleep with no chance to recover between pages. By day five, you're making worse decisions; by day seven, you're making dangerous ones.
Better starting points:
- Quiet service (< 1 page/week on average): 1 week rotation is fine.
- Moderate service (1–3 pages/week): 1 week, but with a secondary engineer who shares daytime load.
- Noisy service (> 3 pages/week): Split shifts. Consider 24-hour rotations, or day/night splits.
- Very noisy service: Fix the noise before you fix the schedule. A schedule change is a palliative, not a cure.
The trap is thinking rotation length is purely a scheduling problem. It's not. It's a signal that something else is wrong with your service or your alerting.
2. Handoff Time
This one is boring and almost always wrong at companies that haven't thought about it.
Most schedules hand off at midnight. This is terrible. It means the outgoing engineer is stuck handing over active incidents while already exhausted, and the incoming engineer is catching up on context while asleep.
Better: hand off during business hours. 10 AM Monday, or end-of-day Friday, with the outgoing engineer explicitly walking through open investigations, known flakes, and any recent deploys that might cause trouble during the incoming shift.
The handoff is a meeting. It's short — 15 minutes — but it's a meeting. Not an automated calendar flip at midnight.
3. Escalation Depth
The primary gets paged. The primary doesn't respond in N minutes. Now what?
This is where most schedules quietly break. The secondary might be defined but unknown to the actual secondary. The tertiary might be the CTO, who's in a different time zone and won't see it until morning. The "final" fallback might be a group chat that nobody owns.
A working escalation policy needs:
- A primary, a secondary, and a tertiary. Three levels. Not two (too fragile), not five (no one knows who they are).
- Explicit timeouts at each level. 5 minutes to primary ack, 10 minutes to secondary, then tertiary. These are defaults — tune per service.
- The same team members across all three levels. Escalating to someone outside the team is a sign your team is too small to own the service. Fix that first.
- A way to signal "I've got it." An ack is not a fix — it's a claim. "I've seen the page, I'm working it, don't escalate yet."
One pattern we've seen repeatedly: escalation policies written once and never revisited, so the secondary is someone who left the company eighteen months ago. Audit these quarterly.
4. PTO and Time Off
Here's where a schedule lives or dies.
Engineers take vacation. They get sick. They have kids. Their parents have surgery. The schedule has to handle all of this gracefully, or it stops being a schedule and starts being a source of resentment.
The baseline:
- PTO is a first-class input. The schedule should know about approved time off and route around it automatically. No one should have to manually "find cover."
- Swaps are cheap. Two engineers agreeing to swap shifts should take two clicks, not a Slack thread and a manager's approval.
- Blackout dates. Holidays, company offsites, team members' weddings. These should be propagated across the rotation ahead of time, not discovered at the last minute.
- Shift buffer. Give people at least 48 hours of notice before a rotation change. A Thursday-afternoon "can you cover tomorrow?" request is how you lose people.
In OpShift, PTO requests integrate with the on-call schedule: approved time off automatically removes an engineer from the rotation for those dates, and the secondary's secondary becomes the primary. It's boring infrastructure, and that's the point — boring infrastructure is what keeps the schedule from breaking every time someone takes a weekend.
5. Follow-the-Sun (vs. Not)
For teams with engineers in multiple time zones, follow-the-sun rotations are a powerful tool. When done right, nobody is ever on-call during their own nighttime hours.
When done wrong, they become a way to silently pass around the worst shifts. "The APAC team is on-call for their own hours" sounds fair until you realize APAC is also covering the quiet hours, and the EMEA team inherits all the traffic-peak incidents during their afternoon.
Guidelines:
- Three regions minimum. Two regions usually means one of them covers weekends.
- Equal incident volume, not equal hours. If APAC handles 20% of incidents and covers 33% of the clock, that's fine. If EMEA handles 60% of incidents in their 33% share, you have an imbalance.
- Handoffs at region boundaries, with warm overlap. 30 minutes of overlap between regions lets the outgoing engineer hand off live incidents rather than abandoning them.
- Reserve single-region rotations for actual emergencies. If APAC is pulled into EMEA hours, that's an event, not a recurring pattern.
Follow-the-sun only works as well as your weakest region. If one region is understaffed, the whole model collapses back onto whoever is left.
Common Anti-Patterns
After watching a lot of teams try to build on-call schedules, the same failure modes keep showing up.
Anti-pattern: "The senior engineer is always the final escalation." What it looks like: every serious page eventually ends up in the same person's inbox. They're tired, they're resentful, and their judgement on the thirtieth page of the month isn't what it was on the first. Fix: rotate the top of the escalation stack. Seniority belongs in the room, not on the pager.
Anti-pattern: "We'll figure out escalation when it happens." What it looks like: no written policy, just tribal knowledge. When the primary is on a flight, nobody knows who covers. Fix: write it down. A paragraph is enough. "If I don't ack in 10 minutes, the system pages Dana. If Dana doesn't ack in 10 more, it pages Sam."
Anti-pattern: "Everyone is always on-call, informally." What it looks like: the official rotation says one person, but in practice everyone watches the alerts channel because they don't trust the rotation. Fix: find out why they don't trust it. Usually it's because the primary has missed pages before. That's a people problem, not a schedule problem, and papering over it with informal coverage burns out your whole team.
Anti-pattern: "The schedule is immutable for fairness." What it looks like: no swaps allowed. No shift exchanges. Because "it wouldn't be fair." Fix: fairness means equal aggregate burden, not equal exact dates. Let people swap. The total hours work out.
Anti-pattern: "Pages go to a chat channel, not a person."
What it looks like: every alert pings #oncall-alerts. Nobody is specifically responsible. In practice, the person who happens to be online handles it, or nobody does. Fix: route to individuals. The channel is for visibility; the page is for responsibility.
The OpShift Approach
The way OpShift's scheduling works reflects all of the above, put into concrete features:
- Schedules are templates, assignments are instances. A weekly rotation template generates specific shifts with specific engineers, which can then be overridden for PTO, swaps, or one-offs without modifying the template.
- PTO is a global block. Approved time off removes the engineer from any schedule covering those dates. No per-schedule re-configuration needed.
- Escalation policies are per-schedule, not per-alert. The schedule knows the primary, secondary, and tertiary. Alerts route through the schedule's policy automatically.
- Swaps are a two-party agreement. Engineer A offers, Engineer B accepts, schedule updates. No manager approval step (they're peers agreeing to trade shifts, not asking for time off).
- Blackout dates at the team level. Holidays and company events are set once and propagate to every schedule the team owns.
The theme: defaults that don't require intervention, overrides that don't require permission.
What Good Looks Like
A well-designed on-call schedule produces the following properties, which you should audit for in yours:
- Every engineer can tell you exactly when they're next on-call, without looking it up.
- Every engineer knows who their secondary and tertiary are, and those names don't change without the engineer being told.
- Nobody dreads their rotation. They may not love it, but it doesn't ruin their week in advance.
- PTO never requires a schedule conversation. You request time off and the schedule handles it.
- Swaps happen without drama. Two engineers agree, it's done.
- Incident count is approximately equal across engineers over a quarter.
- The team can describe their escalation policy in one sentence from memory.
If your schedule doesn't produce these, the fix isn't more scheduling software. It's stepping back and asking which of the five levers is misaligned.
One Last Thing
The best on-call schedule is the one you rarely notice — the one where the pager stays quiet because the service is stable, the handoffs happen smoothly because they're rehearsed, and the escalation path triggers rarely enough that when it does, it actually means something.
The schedule's job isn't to distribute pain fairly. It's to prevent pain from accumulating in the first place. Every design choice — rotation length, handoff timing, escalation depth, PTO integration, regional balance — is in service of that. A schedule that's merely "fair" on a spreadsheet isn't enough if the underlying service is noisy or the escalation is broken or the handoffs happen at midnight.
Treat the schedule as a human system, not a scheduling problem, and your team will stay. Treat it as a calendar, and they'll leave — politely, gradually, and without telling you the real reason.
