brily
← Back to blog
Operations··8 min read

Incident communication is a product feature: a status page playbook

A structured guide to running the status page during an incident: what to say when, how often to update, and why ambiguous language costs more than the outage itself.

B
The Brily team
Founders

An incident is a two-sided thing. On one side, your engineers are fighting the fire. On the other side, your users are deciding whether they still trust you. How the incident shows up on your status page is roughly half of how the incident is remembered.

This is the playbook we use for our own incidents and recommend to everyone we onboard. It's short because good incident comms is about discipline, not cleverness.

The four-state lifecycle, and why it matters

  • Investigating— we see a problem. We're looking. Root cause unknown. ETA unknown.
  • Identified — we know what broke. Fix is underway or deployed. Affected components are named. A rough ETA can be given.
  • Monitoring— fix is in. We're watching to confirm recovery before declaring resolved.
  • Resolved — everything is green. Users are unblocked. A post-mortem is coming within 72 hours for anything above a P2.

The reason to use four fixed states instead of freestyle updates: your on-call engineers and your customer-success team need to be able to predict what a status page entry means without reading it carefully. Four states are low-cognition; a novella is high-cognition.

What to write, at each stage

Investigating

One or two sentences. Name the symptom in user-visible terms ("some users are seeing 502 errors on the dashboard"), not in engineering terms ("the authorization service is returning elevated error rates"). Don't speculate about cause. Don't promise an ETA you haven't earned.

Identified

Three elements: what broke in plain language, which components it affects, and a rough ETA for fix. Error-bar the ETA generously — "we expect resolution within the hour" is safer than "resolution in 10 minutes." Set the expectation, then beat it.

Monitoring

Say the fix is deployed and you're verifying. Give the verification window: "monitoring for 30 minutes before marking resolved." This buys you the room to catch a regression without having to pull back an overconfident "resolved" post.

Resolved

Short. Confirm the user-facing symptom is gone. Name the component that's now healthy. Promise the post-mortem and put a calendar date on it — then actually deliver.

Update cadence

The update cadence you commit to is the main contract with your users during an incident. Our recommendation:

  • Initial update within 15 minutes of the incident being confirmed.
  • Follow-up every 30 minutes until "Identified".
  • Every 30-60 minutes in Identified, or immediately on material state change.
  • One update on entering Monitoring, one on Resolved.

"No update" feels like abandonment. If nothing has changed, post a 30-minute "no change, still working" note. It's a low-cost signal that costs you nothing and buys you enormous patience from users.

Language to avoid

Some phrases make incidents worse:

  • "Some users may be experiencing..."— this is hedging language that irritates users who are definitely experiencing the issue. Say "users are seeing" and name the symptom.
  • "A third-party dependency..."— sometimes true, always reads as blame-shifting. State it neutrally or don't.
  • "We apologise for any inconvenience" — corporate boilerplate that means nothing. Specific apologies after the fact beat generic apologies during the event.

The post-mortem

A post-mortem on the status page is a trust multiplier that most companies skip because it feels humbling. Do it anyway. The format that works best:

  • What happened, in three sentences, in user terms.
  • Root cause, in engineering terms. Be specific. "A misconfigured cache" is useless; "the Redis TTL was set to 0 on a hot path by mistake" is useful.
  • What we're doing about it — three concrete actions with owners.
  • Timeline, minute-by-minute.

Skip the customer-success-team language. The people who read post-mortems are technical leads at your customer companies. They will respect honesty more than polish.

Components, not one giant red bar

If your status page shows a single "Website" component, every incident looks like the whole thing is down. That's bad for partial-degradation incidents and bad for the eventual post-mortem. Break the product into meaningful subsystems — API, Web, Login, Email delivery, Scheduled jobs — and mark incidents against the narrowest component that's actually affected.

Your uptime history will be more honest, and your users will trust the page more, because it doesn't panic when one small thing is degraded.

Testing the page itself

Once a quarter, do a tabletop exercise: a mock incident, a status page update through all four states, a post-mortem. The first time you do this, you'll discover three things about your status page that don't work under pressure. Fix them before the real incident.