brily
← Back to blog
Engineering··7 min read

How we dogfood Brily: what broke, what we learned

We run Brily on Brily. Here's an honest account of what the first month of dogfooding found — the monitors we didn't have, the alerts that flapped, the status page we weren't ready to publish.

B
The Brily team
Founders

The first principle we committed to as a team was that Brily would run on Brily. No separate monitoring stack for ourselves. Our status page is our own product. Our NPS on early access users is collected through our own widget. If something breaks in our product, the break is visible to us the same way it would be to a customer.

This is the polite version people put in "about" pages. The impolite version — what actually happened in the first month of dogfooding — is worth publishing, because it revealed a pile of gaps we would not otherwise have caught before GA.

What broke immediately

We had 12 monitors. Eight of them were misconfigured in ways that were invisible until we tried to use them in anger:

  • Probe region coverage.We had default regions enabled, but hadn't verified any of them were actually reachable from our service IPs in staging. One region couldn't reach us at all; we'd been ignoring the false positives for a week.
  • Alert routing.Everything was going to a single Slack channel, including noise from dev-environment monitors that shouldn't have been firing. The signal was genuinely buried inside three days of on-call.
  • TLS expiry monitor. Configured for the production domain, not the staging domain. Staging certificate expired unnoticed and caused a round of embarrassment during an investor demo.

What broke on the first real incident

Three weeks in, our primary database had a 20-minute connection pool issue. The incident revealed three things about our incident tooling we hadn't predicted:

  • We didn't have a runbook for "database connection pool saturated." The on-call engineer had to diagnose from first principles at 2am.
  • The "promote alert to status page incident" feature in our own product had a 4-second latency on the first click because the component dropdown loaded lazily. Didn't notice in tests. Extremely noticeable at 2am.
  • The post-mortem we wanted to publish on our own status page required a feature we hadn't built yet: timeline attachments. We published a text-only version and put the feature on the roadmap.

This is the main value of dogfooding: the gaps that are invisible in happy-path testing are revealed in the first real incident. No amount of reviewing mocks reveals them.

The NPS we ran on ourselves

We ran our own NPS survey on the 40-person early access cohort after the first milestone release. The result was humbling in a productive way.

Aggregate NPS: 32. Not disastrous, not great. The verbatim comments, though, told a clear story: people loved the three modules individually; the integration between them was underbaked. Specifically, the cross-module features we thought were our differentiator — one-click promotion of alerts to status page incidents, release markers visible on NPS charts — were rough or undiscovered by half the cohort.

We rewrote the onboarding to make those features visible in the first-session flow. Next cohort NPS climbed to 48, with the verbatim drift confirming it was the integration features that moved the needle. Without the tied-to-release cohort comparison, we would have argued about whether the climb was real. With it, we could see it was driven by exposed users, and we stopped arguing.

The thresholds we're still tuning

We advise customers to do quarterly threshold reviews. We ran our first one at month two and found:

  • Three monitors that had never fired — turns out they were checking endpoints we'd deprecated. Deleted.
  • Two monitors that had fired constantly — turns out we had aggressive latency budgets set during a load test and never reverted them. Loosened.
  • One monitor that had fired a real incident and we'd correctly alerted on it — good signal, left untouched.

Ratio of genuine signal to noise in our initial alert stream was about 40%. Not good. After tuning, it's around 85%. This is exactly the conversation we think every customer will have in their first quarter — and the fact that we had it ourselves makes us more honest when we advise it.

What we changed in the product because of it

Dogfooding made four product changes obvious that hadn't been on the roadmap:

  • Quorum defaulting to 2-of-3 — because our own initial defaults were too noisy.
  • Per-channel Slack routing by severity — because our team was drowning in a single channel.
  • Staging-aware alert routing — because we kept getting paged by dev-environment blips.
  • Post-mortem attachments on status pages— because we wanted to publish a real post-mortem and couldn't.

What we're still willing to admit is broken

In the spirit of dogfooding honesty: our onboarding flow is still too long. Our docs are behind the code. Our roadmap is optimistic on the agency tier timeline — we're targeting V0.2 in three months, which is aggressive.

If you're in the early-access cohort and any of these are blocking you, tell us. The meta-lesson of dogfooding is that the product you ship is always a little worse than the demo, and the only way to close the gap is to live in it with your users.