Your Best Incident Detector Is Already Running: Turning Slack Messages Into Alerts
There's a pattern you've probably seen. Customer support posts in #support: "hey, anyone else getting 500s on checkout?" A few minutes later, another customer: "can confirm, my cart won't load." An engineer in that channel reads it, opens the dashboard, and sees — yeah, something's off.
Meanwhile, the monitoring system has been silently processing synthetic health checks the whole time. Everything is green. No alerts. No pages. The incident has been live for ten minutes and the actual detection happened in a chat channel, by accident, because someone was paying attention.
This isn't a failure of monitoring. It's a gap in what monitoring counts as a signal. Dashboards track what they're configured to track. Customers track everything else.
Slack channel listeners close that gap.
The 10-Minute Gap
Most production outages have a characteristic timeline:
- Something breaks (T=0).
- Users notice and start complaining in whatever channel they complain in — support, sales, social media, internal Slack (T+1 to T+5).
- Synthetic monitoring detects the symptom — a failed check, a rising error rate, a degraded latency metric (T+5 to T+15).
- Someone pages the on-call (T+10 to T+20).
The gap between step 2 and step 3 is the "user knows but the system doesn't" window. For some classes of bugs, it's short. For others — anything involving a small cohort of users, or a code path that synthetic checks don't exercise — it can be 30 minutes or more.
That window is where channel listeners earn their keep. The signal is already there. It's arriving as text, in a channel, written by a human. The monitoring system just isn't reading it.
Why Humans Detect Faster Than Machines
It's worth being specific about why human-reported incidents arrive before automated ones:
- Humans trigger the actual code paths your synthetic monitors don't. Your health check hits
/healthand gets 200. Your customer hits/api/checkout/submitwith a specific product in a specific currency on a specific mobile browser, and gets a 500. You can't write a synthetic for every real code path. - Humans notice degraded behavior before it crosses thresholds. "This page is loading weird today" is a sentence that precedes a P99 latency alert by 10 minutes.
- Humans are better at anomaly detection in the small. One user reporting an unusual error is information. Your monitoring system sees one error out of ten thousand requests and classifies it as noise.
- Humans correlate across sources. "Checkout is slow AND the emails aren't arriving AND the dashboard shows stale data" is a correlation a human draws in seconds. A machine needs three distinct alerting rules.
These aren't arguments for replacing synthetic monitoring. They're arguments for augmenting it with the signal your users are generating for free.
What a Channel Listener Does
A channel listener, at its core, is simple:
- Connect to a Slack channel (or any chat source).
- Stream messages as they arrive.
- Match each message against a set of rules.
- For matches, create an alert with the message as context.
The plumbing is trivially easy with Slack's Events API. The hard part is the rules — specifically, writing rules that catch real problems without generating a thousand false positives from jokes, links, unrelated threads, and the inevitable "error" someone types while explaining an entirely different topic.
Keyword Matching Isn't Enough
The naive version of a listener is keyword matching: "alert when error, down, or outage appears." This works for about six hours. Then:
- "The error message is really helpful, good work" → false alert.
- "The docs site was down last Tuesday, is it fixed?" → false alert.
- "We need to prepare for the outage tomorrow at 2 AM (planned)" → false alert.
You lower the trust threshold, engineers start ignoring the channel, and you're back where you started.
The fix is the same fix we apply to webhook noise: layered filter rules. A listener shouldn't fire on any mention of a keyword. It should fire when a combination of signals suggests something is actually wrong.
Filter Rules for Listeners
A production-ready listener rule has three parts:
1. Keyword match (the trigger). The message must contain one of the alert-eligible phrases. Think of this as a coarse filter — it's the first pass.
2. Context match (the qualifier).
The message must also come from a specific channel, user group, or thread. A customer in #support saying "error" is meaningful. An engineer in #dev-random saying "error" while debugging is not.
3. Exclusion list (the filter). The message must not contain signals that indicate it's not a real incident. Phrases like "fixed," "resolved," "planned," "false alarm" — if those appear, skip the alert.
A real filter might look like:
TRIGGER:
channel: #support-priority
keywords: ["down", "outage", "500", "can't access", "not working"]
REQUIRE:
- message.author NOT IN (internal_bot_list)
- message.thread_depth < 3 // ignore deep replies
EXCLUDE:
- message contains "planned"
- message contains "yesterday"
- message contains "fixed"
- message contains "resolved"
DEDUPE:
- don't alert if another alert for the same channel fired in the last 10 minutes
Run this over a week of real support messages and you'll catch the 3 messages that were actually reporting incidents while filtering out the 47 that weren't.
Real Use Cases
Here are the listener configurations we've seen work reliably in production.
Customer-reported outages
Watch #support, #customer-success, and any customer-facing channels for outage-shaped language. Dedupe aggressively (one alert per channel per 10-minute window) to avoid firing on each individual complaint.
The signal you're looking for: any complaint from a customer-facing source, because one complaint usually means ten unreported ones.
Third-party service degradation
Watch vendor status channels (Slack bots from Stripe, Twilio, AWS, etc.) for messages about incidents on services you depend on. This lets you react to a Stripe outage at the moment the status update lands, rather than 20 minutes later when your own error rate crosses the alerting threshold.
Security-relevant chatter
Keywords like "phishing," "leaked," "credential," "suspicious email" in internal channels. These are reports you want escalated immediately, even if nobody filed a formal ticket.
Deploy correlation
Watch #deploys for deploy announcements, correlate timestamps with alerts firing in the next 15 minutes. Creates an implicit "did the last deploy break something" signal that shows up as metadata on alerts rather than as standalone noise.
Feedback loop on reliability language
More experimental, but useful: listen for phrases like "this is the third time this week" or "I keep seeing this" — these are signals that a bug is recurring even if the individual occurrences are being closed. An alert gets created with the specific phrase highlighted, so the reliability team can pick up patterns humans noticed before dashboards did.
The Integration Architecture
A channel listener that works well shares an architecture with the rest of your alerting pipeline. It shouldn't be a bolt-on with its own notification logic.
The pattern:
- Listener ingests messages from Slack (or Teams, or Discord, or IRC — the source is abstracted).
- Filter rules evaluate against each message. Non-matching messages are dropped.
- Matched messages create alerts through the same alert pipeline as webhooks and monitors.
- Alerts go through the same grouping logic as everything else — so three complaints in a row become one alert with three occurrences, not three separate pages.
- Notifications fire through the same channels — Slack, PagerDuty, email — based on the alert's severity and routing rules.
The key insight: once a listener has produced an alert, it's just an alert. It gets grouped like any other alert, escalated like any other alert, resolved like any other alert. The source (Slack vs. synthetic monitor vs. webhook) is metadata, not a separate lifecycle.
This matters because incidents rarely originate from a single source. A real outage looks like: synthetic monitor fires → customer complaint arrives in Slack → webhook from error tracker spikes → on-call engineer ties them together. If all four create separate, unrelated alerts, the on-call is reading tea leaves. If they all attach to one incident record with four occurrence types, the story reads itself.
Best Practices for Channel Selection
Not every Slack channel is a good listener source. The rule of thumb: listen to channels where humans describe problems, not channels where humans solve them.
Good sources:
- Customer-facing channels (support, success, community)
- Public vendor status channels
- Company-wide
#announcements-style channels where outages get cross-posted - Security incident response channels (with appropriate access)
Bad sources:
- Engineering channels where debugging language is normal and not indicative of incidents
- Random/social channels (high false-positive rate)
- Channels you don't have explicit permission to monitor (this one is non-negotiable)
The filtering principle: a channel is listener-worthy if the base rate of "message means something actually broke" is at least 1 in 20. Below that, your alert stream will be 95% noise no matter how good your filters are.
Privacy and Consent
One thing you have to get right: tell people the channel is being monitored. There's a meaningful difference between "the team uses this tool to catch outages" and "someone is running a bot that reads every message."
Concretely:
- Document which channels are listener sources.
- Surface the listener in Slack's app directory so members can see the integration.
- Limit listener scope to public channels or channels where the integration has been explicitly invited.
- Don't use listeners for performance monitoring of individuals. Ever.
This is less about legal compliance and more about trust. A team that feels surveilled will stop using the channel, and then the listener has nothing to listen to.
Wrapping Up
Channel listeners aren't a replacement for synthetic monitoring, error tracking, or any other piece of your observability stack. They're a cheap, high-signal augmentation that closes the gap between "users know" and "your dashboards know."
The recipe:
- Pick 2-3 channels where humans describe real problems.
- Write filter rules that combine keyword triggers with context and exclusion.
- Pipe matched messages into your existing alert pipeline.
- Let occurrence grouping handle duplicate complaints.
- Tune over time — listeners improve with feedback.
You already have the best incident detection system you're going to build, and it's the humans reading your chat. The only question is whether your monitoring system is reading what they write.
