A CFO and a CTO standing in front of a whiteboard covered in unit-economics math, with a stack of GPU invoices on the table between them

mins to read

•

The Real AI Inference Cost Behind Your AI-Native Cloud Bill

Rohan Mehta

Director, Cloud & AI Engineering

Published

April 30, 2026

Share this on

Getting Started

The new shape of AI inference cost

Choosing how to build: RAG, fine-tuning, or distillation

GPU strategy is now a CFO question

What survives contact with margin

Engineering

inference economics

AI unit economics

Building AI products is exciting. Paying for them is not. Every founder we meet in 2026 is shipping fast, growing fast, and quietly worried about one slide in the deck. The slide labeled gross margin. The one nobody wants to project past year three.

At ATCON, we work with engineering and finance leaders who are trying to make the math work. The cloud bill is the symptom. The real story is the cost of running AI. And it shows up everywhere: in the spread between input and output tokens, in the chips you rent, in the model you pick. CFOs are starting to ask. Most engineering leaders do not have a clean answer.

The new shape of AI inference cost

Old-school cloud cost work was about idle servers and oversized databases. That playbook does not help here. When 70% of your variable cost is tokens, chasing idle Azure VMs is rounding error.

Two AI products built on the same model can have very different unit economics. The difference is not in the Azure portal. It is in how long your prompts are. How long the answers are. How often a power user hits retry. That is where the money goes.

The numbers tell the story. OpenAI's profit margin slipped from 40% to 33% across 2025 as inference costs quadrupled. Cursor, one of the fastest-growing AI products of the year, was reportedly paying about $650 million a year to Anthropic on roughly $500 million in revenue. That is a profit margin of negative 30%. That is not a software business. That is a tolling deal with a model provider.

The FinOps Foundation's 2026 report says 98% of practitioners now actively manage AI cost. Two years ago that number was 31%. Teams who treat this as procurement are the ones still surprised by the bill.

AI-native is not a product category. It is a money problem. And most teams using the label have not done the math.

Choosing how to build: RAG, fine-tuning, or distillation

How you build the AI is a money decision before it is an engineering one. We tell clients to put it on a whiteboard, in dollars per active user per month, before opening a notebook.

There are basically three plays. Each one has a price tag.

RAG on a frontier API: Cheap to start. Fast to ship. But you pay a tax on every single user action, forever. Fine for a six-month prototype. Brutal at 50,000 daily users.
A fine-tuned small open model with RAG on top: The standard 2026 stack. On narrow tasks it runs 10 to 100 times cheaper than a frontier API. Quality holds up for the workflows enterprises actually pay for. This is where Mistral AI's open weights earn their keep, since teams can host them on Azure AI Foundry or in a sovereign EU region.
Distill and shrink: Not buzzwords. Line items on your bill. A smaller version of your tuned model can cut serving cost another 4 to 6 times with almost no loss in quality on your top intents, if you have the testing discipline to prove it.

The trap is treating any of these as a one-time pick. Model prices keep dropping. Open weights keep catching up. The right call in Q1 is a refactor by Q4. Plan for portability or plan to repay the bill.

Line chart comparing AI app gross margins (41% in 2024, ~52% in 2026) against mature SaaS gross margins (~75-80%) over the same period AI app profit margins are improving. They are still 25 to 30 points behind mature SaaS. Source: a16z, 2026.

AI app profit margins are improving. They are still 25 to 30 points behind mature SaaS. Source: a16z, 2026.

GPU strategy is now a CFO question

For any team running serious AI volume, the chip question has moved from infra to finance. Buy, rent, or co-locate is a board topic.

On-demand B200 capacity costs about $5.62 an hour. H100 is about $3.43. That is a 64% premium on the sticker. But the sticker lies. SemiAnalysis benchmarks show B200 delivering about 2 cents per million tokens on a popular open model, 4 to 5 times cheaper per token than H100 on heavy AI workloads. The expensive chip is sometimes the cheap one.

The right answer for a 50-person SaaS is not the right answer for a 5,000-seat enterprise rollout. Reserved capacity and co-location pencil out once usage is predictable for 12 to 18 months. Bursty workloads on premium chips burn cash faster than they burn watts. European operators on GAIA-X-aligned regions, or building under EU AI Act obligations, have to factor sovereignty into the same spreadsheet. That is one reason SAP and ING have moved parts of their inference footprint onto co-located capacity.

And then there is DeepSeek-V4. It arrived at about one-sixth the cost of frontier closed models and reset the global price floor. A contract that looked smart in January can look expensive by July.

What survives contact with margin

After a year of helping engineering and finance teams unwind these bills together, a short list of things actually works.

Your chip contract, your model contract, and your region choice are the same decision now. Operators who treat them separately are the ones writing apology emails to the CFO.

Three numbers your board should see every month

Tokens per active user. Cost per active user. Profit margin per workflow, not per company. If your dashboard still reports total AI spend without these three lenses, your board is flying blind. The shift from per-seat pricing to per-token, per-feature, per-user is the new ARR-per-customer math. Teams who built the tracking early are the ones still raising at clean multiples.

The model-portfolio approach

Stop picking a model. Pick a portfolio. Send easy questions to a small tuned model, perhaps a Mistral or an Aleph Alpha variant. Send hard questions to a frontier API on Azure OpenAI Service. Route translation-heavy work to DeepL. Audit the routing every week. Treat the model as a thing you buy, not a thing you fall in love with. Assume your year-one profit margin is wrong by 10 points in either direction. Build the cost-tracking layer before you need it.

If you are five quarters from a funding round and your unit economics still live in a Notion doc, talk to us before the diligence call does the talking for you. At ATCON, we help engineering and finance leaders model the real cost of an AI roadmap, the inference, the chips, the routing, the refactor, and build the dashboards a board will actually trust.

Let's BUILD Your Digital Future

Do you have

any questions?

Coffee’s on us, let’s talk

Address

Brussels, Belgium

Avenue Louise 523, 1050 Brussels, Belgium

Contact Number

+32 470 20 45 12

connect@atconglobal.com