How I Know My System Is Healthy, And What It Costs To Keep It That Way

Cost-to-serve, scaling, reliability, throughput. A CTO's measurement framework for any software product, with six months of real production data.

Last week I sat down to put together the technology and cost-to-serve review for the first half of 2026. It is something I do quarterly. The first time it felt like a chore. By the second time it had become a habit. By the third it had become a discipline.

When you are the only engineer in the company, no one else is going to tell you whether the system you built is healthy. There is no peer review of your dashboard. There is no second pair of eyes on your alert thresholds. There is no team standup where someone says “have you looked at this number lately?”. You build the mirror, or you fly blind.

Earlier this year I wrote about the first three months of building Atherio in numbers. This is the H1 update. Broader, deeper, and from a different angle. Less about throughput and velocity, more about what I measure, why I measure it that way, and what the numbers told me to do next.

It’s Not About The Numbers, But About The Decision at What Numbers to Look

A team has natural reality checks. People disagree about what is healthy. Engineers complain when a deploy is painful. Customer success surfaces friction the architecture didn’t see coming. The system is observed from many angles, and the angles correct each other.

For now, I don’t have this luxury at Atherio.

What you do not measure, you cannot see. What you cannot see, you cannot reason about. And what you cannot reason about quietly becomes the thing that breaks. Not loudly. Not in a way the dashboard would catch. Just slowly, in the gap between what the system actually does and what you remember it doing.

So the first job is not to look at numbers. It is to decide which numbers to look at. To choose deliberately rather than by default. A dashboard built from defaults will tell you what the platform vendor wants you to see, which is rarely what you actually need to know.

In my experience, five dimensions matter for a technology startup at this stage. What it costs to serve a customer. How it scales when more arrive. Whether it works when they use it. What the system is actually producing. And what is going to bite later. Each one is its own discipline. The rest is texture.

Measurement is how the system you built tells you what it has actually become. It is the feedback that a Cycle-Driven Engineering loop runs on.

The Anatomy Of Cost-To-Serve

A definition first, since the term is not universal. Cost-to-serve is the all-in cost of running the platform that delivers the product to customers. Infrastructure plus tooling. The bill that arrives whether or not anyone shipped code this month. It excludes the cost of building the product, which is a separate line entirely. Cost-to-serve answers a different question: if engineering stopped tomorrow, what would the platform still cost to keep running for the customers we already have?

Atherio’s cost-to-serve peaked at around £404 per month in April. That is the all-in infrastructure and tooling spend across production and test environments. Most months it sat closer to £290.

Broken down by service across the H1 window:

Azure App Service (including the Static Web App): £398
SQL Database: £209
Foundry Models, which is Azure OpenAI: £199
Log Analytics, which absorbs Application Insights ingestion: £39
Azure Monitor alert rules: £4
Key Vault, despite ~792 operations over 3.7 months: £0.01 (This is because we cache read secrets and don’t read from Key Vault on each incoming API request)

The rest is rounding error.

App Service and SQL are essentially fixed within their tiers. A single P1v3 instance costs the same whether CPU sits at 5% or 95%. SQL S2 costs the same whether DTU usage is at 1% or saturated. They step when capacity steps, either because autoscale-out adds instances or because a tier change is triggered. They do not move with load.

Log Analytics and Azure Monitor scale, but slowly and predictably with telemetry volume.

Foundry is different. Foundry scales linearly with how much customers use the product. Every communication event we process for sentiment, every Athi chat token, every DEE calculation, lands here. It is the only line in the cost-to-serve breakdown that moves with usage rather than with capacity.

That makes Foundry simultaneously the most important number to watch and the only one that can damage margin at scale.

I picked Azure OpenAI rather than going to OpenAI directly for three reasons. Billing predictability, since it flows through the same Azure cost management as everything else. Regional residency, since the rest of the platform sits in UK South. And consistent model tiers, so I am not chasing version-skew across two providers. The cost premium for going through Azure is not zero. It is also not large enough to outweigh the operational coherence.

Some Basic Unit Economics

Aggregate costs lie. They smooth over the variance that actually matters. A platform that costs £290 per month to serve looks the same on paper whether it is serving 47 managers or 470. The shape of the unit economics only emerges when you divide.

Two units are useful in our case. Cost per licensed manager, which is what we sell. And cost per live direct report, which is what the system actually serves.

Per licensed manager, the trajectory over the half:

Feb: £13.55
Mar: £16.38
Apr: £7.22
May (partial): £5.90

Per live direct report:

Feb: £1.88
Mar: £2.28
Apr: £1.30
May (partial): £1.22

The drop between March and April is not magic. It is the fixed cost amortising over a larger denominator as more managers came online. The platform did not become cheaper. The base spread thinner.

But the per-direct-report number is the truer one. The reason is in the variance between tenants. Across May, the heaviest tenant in our customer base processed roughly 1,006 communication events per live direct report. The lightest processed 348. The heaviest tenant earlier in the half ran above 2,300 events per direct report before tapering.

Translated into cost, a comms-heavy customer is roughly four times more expensive to serve than a quiet one. Same product. Same infrastructure. Same per-event cost. Different usage profile.

That number is not a problem. It is an input. It feeds into how we size deals, prices tiers, and identifies which customer profiles are healthiest for the business. Without it you are flying with a single average and pretending all customers cost the same.

I separate the two units deliberately. Cost per licensed manager is the headline number that lives in commercial conversations. Cost per direct report times communication intensity is the operating reality that lives in mine.

The aggregate makes the business look uniform. The unit shows you that no two customers cost the same to serve.

Capacity Is Bought Reactively, Not In Anticipation

There is a comfortable temptation in engineering to buy capacity ahead of need. It feels like prudence. It looks like foresight. It is, more often, a margin you gave away to feel safe.

I apply a single rule. We do not step the SQL tier or the App Service Plan until sustained P95 utilisation crosses around 45%. Below that, the existing tier has headroom and the spend is unjustified.

Today the numbers are nowhere near that threshold. Across a recent 7-day sample, with 5-minute granularity:

SQL S2 DTU: median 0%, P95 1%
App Service Plan P1v3 CPU: median 6%, P95 9%
App Service Plan P1v3 memory: P95 42%

This means SQL S2 has DTU headroom out to roughly 2,000 managers before the tier matters. The App Service Plan steps once, P1v3 to P2v3, somewhere around 300 to 500 managers, and again to P3v3 around 1,000. Even at 1,000 managers, infra plus tooling lands at roughly £1,345 per month. The curve is dominated by the linear variable, around £0.16 per direct report per month, not by tier steps.

A note on what is and isn’t the real risk in that forecast. DTU has room. The likelier first SQL trigger is storage, since the database is currently capped at 10 GB and data accumulates. A storage cap raise or an S3 step is cheap. Worth monitoring, not worth preempting.

The trap with capacity is that engineers like to bump tiers when something feels close. P95 memory at 42% feels close. It is not close. It is half the tier’s capacity. A rule based on observed sustained utilisation rather than vibe is the only honest way to spend on capacity.

Reliability Is What Users Experience

The headline reliability number for the H1 window is straightforward. Across all business API endpoints, the 5xx rate sat at approximately 0.01%. The exact count: 11 server errors in March, zero in all other months, against a denominator that grew from 1,468 requests in February to 11,481 in May.

Functional success, which I define as 2xx responses divided by total requests excluding 401 and 403, traced this curve over the half:

Feb: 95.96%
Mar: 98.15%
Apr: 98.53%
May (partial): 99.27%

That trajectory matters because volume grew roughly eight times across the same window. Reliability did not degrade under load. It improved.

The more interesting number is what I deliberately excluded.

401 responses are the SPA’s normal “am I logged in?” check against the auth endpoint. They are protocol, not failure. 403 responses are RBAC permission denials on endpoints the user is not authorised to call. They are expected, not faults. Including them in the denominator would have made the dashboard a liar. A user who is not logged in does not consume the system in the way a 200 does. Lumping the two together is statistical malpractice.

The same discipline applies to exceptions. Across the window, the exception count rose from 126 in February to 3,748 in May. That looks bad on a sparkline. It is not bad. Roughly 93% of those exceptions are third-party API transients, dominated by two sources: Graph throwing OData errors (mostly recoverable. 4,895 across the window) and Azure OpenAI’s ClientResult transients (2,363). The retry plus backoff layer absorbs them and those always happen during the nightly service runs, never on user requests.

There is one honest gap in this picture. I do not run synthetic availability tests. I cannot, today, report a true uptime percentage. The 5xx rate is a strong proxy that I think is enough for the current stature of the system.

Reliability is what users experience. Everything else is decoration.

Throughput Is The Heartbeat

There is one operational number that ties all of the above together. Throughput.

Across the H1 window, Atherio processed approximately 1.75 million communication events. Every email, every Teams message, every interaction that the platform analyses for sentiment, signal, or coaching opportunity is one event. They are dated by when they actually occurred, not by when we ingested them.

Divided by the Foundry spend, the variable AI cost per 1,000 communication events comes in at roughly £0.11.

That number is not impressive. That is the point. It is small enough that the unit economics work at scale. It is stable enough that I can forecast confidently. And it is the single metric that moves when the business moves. More managers onboarding means more direct reports. More direct reports means more events. More events means more Foundry spend, more SQL writes, more inference, more value delivered.

Find the one number that moves when the business moves. Watch it weekly. Everything else is downstream.

The Discipline, Not The Dashboard

There is a temptation, especially solo, to treat measurement as proof of effort. To accumulate dashboards. To wave at telemetry as if volume of data were the same as understanding.

It is not. A dashboard you do not read is worse than no dashboard, because it lets you believe you are observing the system. A metric you do not act on is decoration. A number you collect but never interrogate is overhead.

The discipline of being a CTO at this stage is not building the dashboard, but choosing the five or six numbers that actually matter, refusing the dozens that flatter, and being honest about the ones you do not yet know how to capture.

The system we built is healthy by the measures I chose. Some of those measures will be wrong. I will discover that, and replace them. That is also part of the discipline.

Measurement is not a report card. It is the conversation between the CTO and the system they built. The numbers do not make the decisions. They just refuse to let you pretend.

FAQ

What is cost-to-serve?

Cost-to-serve is the all-in cost of running the platform that delivers the product to customers, infrastructure plus tooling, the bill that arrives whether or not anyone shipped code that month. It excludes the cost of building the product, which is a separate line entirely. It answers a single question: if engineering stopped tomorrow, what would the platform still cost to keep running for the customers we already have?

Which numbers actually matter for a startup’s system health?

Five dimensions: what it costs to serve a customer, how it scales when more arrive, whether it works when they use it, what the system is actually producing, and what is going to bite later. Each one is its own discipline. The first job is not to look at numbers, it is to decide which numbers to look at, rather than accepting whatever a vendor dashboard shows you by default.

Why track cost per direct report instead of just total monthly cost?

Aggregate costs hide the variance that matters. The same platform can serve very different customers: across one month the heaviest tenant processed roughly 1,006 communication events per live direct report against 348 for the lightest, which makes a comms-heavy customer about four times more expensive to serve. That per-unit number is an input into how you size deals, price tiers, and spot which customer profiles are healthiest for the business.

When should you upgrade an infrastructure tier?

Not on a hunch. The rule here is to hold the current tier until sustained P95 utilisation crosses around 45 percent; below that, the tier still has headroom and the spend is not justified. Buying capacity ahead of need usually is not prudence, it is margin you gave away to feel safe.

Why exclude 401 and 403 responses from reliability numbers?

Because they are not failures. A 401 is the app’s normal “am I logged in?” check and a 403 is an expected permission denial, so counting them as errors would make the dashboard lie. Reliability is measured on what users actually experience, with the 5xx rate, around 0.01 percent across the half, as the honest signal.