brily
← Back to blog
Operations··9 min read

Beyond the green check: what a meaningful uptime monitor validates

A 200 OK is the weakest signal an uptime monitor can give you. Here's a practical guide to monitors that actually tell you whether your product is working.

B
The Brily team
Founders

Every uptime monitoring vendor demos the same scenario: your site goes down, their tool pages you, you fix it, you're a hero. It's compelling in the sales demo and almost irrelevant to the 95% of outages that actually happen in production.

Real outages in 2026 look like this: the homepage returns a 200, the API returns a 200, the login form submits successfully, and something behind the login wall is broken for 40% of paid users. A monitor checking whether the homepage returns 200 tells you nothing. Everything is fine, except what your users are actually doing.

The reason 200 OK is misleading

HTTP 200 means the server returned a response. It says nothing about what's in the response. A catastrophic backend failure that causes your framework's error page to render — still a 200. A cached stale response served by CDN while the origin is on fire — still a 200. A login page that loads perfectly but rejects every credential — 200.

This is why assertions matter more than status codes. The cheap version is a substring check: assert that the response body contains a known string ("Log out" on an authenticated page, a specific CTA on the homepage). The better version is schema validation on API responses. The best version is a scripted check that does something — logs in, loads a page behind auth, asserts on the DOM.

The monitor-pyramid

For most products, here's the order I'd build monitors, in priority:

  1. One synthetic check per revenue-critical user journey. If it's a SaaS product: sign up, log in, core action. If it's e-commerce: browse, add to cart, checkout.
  2. One check per critical API endpoint with response schema assertion and latency budget.
  3. TLS certificate expiry. Cheap to monitor, catastrophic to miss.
  4. Heartbeat monitors on scheduled jobs. Every backup, every nightly aggregation, every webhook consumer should check in on a cadence. Silence means failure.
  5. DNS / CNAME health for customer-facing domains. Especially if you offer status pages or custom domains to your users.

Multi-region isn't a luxury

A monitor that runs from one region tells you whether the monitor's network link to your server is working. That's mostly uninteresting. Multi-region monitoring — or at minimum, two-region — lets you distinguish "this one probe's ISP is flaky" from "the service is actually down".

The practical rule we recommend: never declare an incident on a single probe's failure. Require at least two regions to agree within the same window. On Brily this is called quorumand it's configurable per monitor; the default is 2-of-3.

Latency budgets: worth setting, painful to tune

A latency budget is an assertion like "p95 response time must be under 800ms". It's one of the highest-value signals for a healthy product, and it's also the most common source of alert noise, because your actual p95 is not a straight line — it wobbles with traffic, CDN warmth, background jobs.

Practical guidance:

  • Set the budget above the 90th percentile of your last-30-days baseline. If your actual p95 is 600ms, the alerting threshold should be 900ms or higher.
  • Always combine latency thresholds with duration windows. "p95 over 900ms for 10 minutes" is signal. "one slow request" is noise.
  • Review latency budgets every quarter. If you've been alerting on them, tune them. If you haven't been alerting on them, tighten them.

The thing you should monitor that nobody does

Password reset. Every SaaS I've worked with has had at least one 4-hour outage caused by a broken password reset flow that nobody noticed until support tickets piled up, because password reset is not a common path — so it doesn't show up in your error dashboard as anomalous.

Add a synthetic monitor that triggers password reset to a dedicated test inbox (Mailosaur, or an internal email account), clicks the link, and confirms the new password works. Run it every 15 minutes. The first time it catches a broken reset flow, you'll be grateful.

Monitoring-as-code

Past a certain team size (four engineers, ish), click-to-create monitors stops scaling. You want your monitors in source control, diff-able, reviewable, and redeployable on a new environment. Brily's API ships with a simple declarative format that you can keep in the same repo as your application code.

This also solves the "who owns this monitor" problem, which comes up reliably around month nine of any team's monitoring journey. Monitors in git have an owner; monitors in the UI have a mystery.

What to cut

You should have fewer monitors than you think. Every monitor is a maintenance burden: it can break, flap, get out of date as the product changes. A team with 40 monitors and no tuning discipline has worse signal than a team with 10 monitors and good thresholds.

Once a month, prune. Monitors that haven't alerted in 90 days are either perfectly tuned or checking something uninteresting. Be honest about which.