All posts Sovereign AI

Open models vs proprietary APIs: which choice for sovereign AI?

L'équipe RagNight · 12 min read · January 13, 2026

Self-hosted open models or closed APIs? 2026 landscape (Llama 4, Mistral, Qwen 3, DeepSeek, Gemma), five trade-off axes, a per-use decision table, and a sensitivity-based routing strategy.

"Open models or proprietary APIs?" Framed that way, the question is a trap: it assumes a side to pick once and for all. Reality is more pragmatic. The real question is not ideological but operational: which tool for which use, and how much control do you require over your data? A mature company often uses both, routing between them by context.

Let's walk through the trade-offs, without dogma, to decide knowingly.

"Open" does not mean what you think

First misunderstanding to clear up: open weights ≠ open source. Most "open" models publish their weights under a license allowing commercial use, sometimes with restrictions (active-user thresholds, prohibited-use clauses, attribution requirements). That is not the same as free software in the strict sense, where training code and data would also be released. In practice, you get an artifact — the weights — that you can run on your own infrastructure, but you have neither the training recipe nor any guarantee you could reproduce it. Read the licenses line by line, especially if your product exceeds a few million users or you operate in a regulated sector.

This nuance has an important operational corollary: "open" gives you the right to run the model wherever you want, and therefore control over where data is processed. That lever is precisely what matters for a sovereignty strategy — far more than the philosophical purity of the license.

The 2026 landscape of open-weight models

The choice today is strong and varied. Each family has its sweet spot:

  • Llama 4 (Meta) — the generalist reference. Massive ecosystem (inference tooling, fine-tuning, quantization, community). Often the default when you want a versatile, well-documented model with maximum third-party tooling. Ideal for a first self-hosted production deployment where you don't want to be the one hitting the bugs.
  • Mistral Large 2 and Mistral Small 3 — European roots, excellent multilingual quality (French, German, Spanish, Italian). Large 2 for demanding generation and reasoning; Small 3 when you want an aggressive quality/cost ratio, deployable on a single GPU, perfect for high-volume RAG answer generation. A bonus for a European company: a provider under EU law, also available via an EU-hosted API.
  • Mistral NeMo — a compact model (around 12 billion parameters) co-developed with NVIDIA, built for efficiency. A good candidate for well-scoped tasks (extraction, classification, query rewriting) where a large model would be overkill.
  • Qwen 3 (Alibaba) — very strong, among the best open models overall, with excellent multilingual abilities and a wide range of sizes (from small embeddable models to very large ones). Relevant if you have multilingual needs beyond Europe (Chinese, Arabic, Asian languages) or want to fine-tune the size to your GPU budget.
  • DeepSeek-V3 and DeepSeek-R1V3 is a capable generalist with a mixture-of-experts architecture, economical at inference relative to its size. R1 specializes in reasoning: it "thinks" before answering, which makes it relevant for multi-step analysis, verification, or decomposition of complex problems — at the cost of higher latency and token consumption.
  • Gemma 3 (Google) — compact, efficient models designed to run on modest hardware. A good choice for edge deployments, lightweight assistance tasks, or as a first-filter model in a cascade.

The performance gap with the best closed APIs has largely closed for most enterprise uses. On well-scoped RAG tasks — where the model synthesizes a context you supply rather than reasoning in a vacuum — a good open model often matches them.

Keep the principle: size is chosen by task, not by reflex. A 7-to-24-billion-parameter model is frequently enough to produce a correct RAG answer from good context; reserve the very large models for tasks that genuinely justify them.

What closed APIs really offer

Proprietary APIs (Claude by Anthropic, GPT / o-series by OpenAI, Gemini 2.x by Google) keep real strengths:

  • Top raw performance, often leading on complex reasoning, agentic tasks, and edge cases.
  • Zero infrastructure: no GPU to provision, no inference server to optimize, no MLOps. You call a URL and pay per use.
  • Continuous updates with no effort: you benefit from new versions without retraining or redeploying anything.
  • Immediate elasticity: a traffic spike is absorbed by the provider, where your GPU cluster would have saturated.

In return: vendor dependency (pricing, model deprecation, availability), per-token cost that climbs linearly with volume, and above all your data leaving your perimeter — critical for sovereignty. Even with contractual non-retention and no-training commitments, prompts transit through third-party infrastructure, often outside the EU. For certain processing (health data, sensitive customer data, trade secrets), that fact alone can be a deal-breaker regardless of cost.

Five axes of trade-off

To decide, score each option on five dimensions. None settles the matter alone: it's their combination that dictates the choice.

  1. Performance — is the model good enough for the intended use? The right question is not "is it the best in the world" but "does it pass my golden set of reference questions with an acceptable rate of faithful answers." Measure with a reproducible method (e.g. RAGAS: faithfulness, answer relevancy, context precision) rather than by feel.
  2. Total cost (TCO) — per token for APIs; GPU + operations + amortization + supervision for self-hosting. The tipping point depends on volume (see below). Classic trap: comparing an API's per-token price to a GPU's hourly cost alone, forgetting ops, redundancy, and engineering time.
  3. Control / sovereignty — who controls the model, its availability, its evolution? With an API, a model can be deprecated or change behavior overnight; with open weights, you freeze a version for as long as needed (useful for reproducibility and audits).
  4. Data confidentiality — do your prompts and documents leave your infrastructure? And if so, to which jurisdiction, under which contractual guarantees? This is where the EU AI Act and GDPR weigh in: for a high-risk system, traceability of processing is not optional.
  5. Operational effort — do you have the GPU/MLOps skills to self-host sustainably, not just for a demo? Serving a model in production means handling scaling, updates, supervision, and incidents at night.

Across the five axes, the closed API often wins on performance, operational effort, and elasticity; self-hosted open wins on control, confidentiality, and TCO at high volume. Your decision comes down to weighting these axes by use case.

The TCO tipping point: an order of magnitude

This is the recurring question: at what volume does self-hosting become cost-effective? No universal number, but a simple logic and a conservative order of magnitude.

On the API side, cost is variable: it grows with every token processed. On the self-hosted side, cost is largely fixed: a GPU rented from a sovereign European host costs a roughly constant amount per month, whether you use it at 10% or 90%. The tipping-point calculation is therefore about finding the volume at which the monthly fixed cost of your infrastructure drops below the variable API bill for the same work.

Reasoning in orders of magnitude, deliberately conservative:

  • Assume a mid/high-end dedicated GPU at an EU host of around €1,500 to €2,500 per month, able to serve a mid-sized open model (7–24 billion parameters) continuously.
  • On that GPU, you can reasonably process tens of millions of tokens per day once inference is optimized (batching, quantization).
  • A capable API charges for the equivalent work at a per-million-token price that, aggregated over that volume, quickly exceeds that monthly rent once you run continuously.

Practical conclusion: as long as your usage is sporadic or low-volume, the API almost always wins (you pay only for what you consume, with no idle GPU). Once you reach sustained, predictable traffic that would reasonably saturate a GPU for several hours a day, self-hosting tips in your favor — and the gap then widens to your advantage as volume grows.

The decisive factor is not raw volume but your hardware's utilization rate. A GPU at 15% occupancy costs more per useful token than an API. A well-filled GPU costs a fraction of the price. Before buying, measure your real load over a few weeks.

Don't forget to add the invisible line items to the fixed cost: redundancy for high availability (often a second GPU), engineering operations time, and a fast-moving market that may render your model obsolete before amortization. The right threshold includes these, not just the GPU rent.

Decision table by use

Use Default recommendation
Fast prototyping, low volume Closed API — maximum velocity, no infra
Very high, stable volume Open self-hosted — TCO tips in its favor
Ultra-sensitive / regulated data Open self-hosted or EU API with contractual non-retention
Demanding European multilingual Open Mistral / Qwen, very competitive
Occasional complex reasoning Closed API or self-hosted DeepSeek-R1
Scoped, high-volume tasks (extraction, classification, query rewriting) Small open model (Mistral Small 3, NeMo, Gemma 3) — cheap and sufficient
Unpredictable traffic spikes Closed API or overflow to the API beyond cluster capacity

There is no single answer: there is a per-use trade-off. That's exactly why routing is the winning strategy.

The hybrid approach: route by sensitivity

The best architecture often combines both worlds via routing. The idea: don't pick one model for the whole company, but steer each request to the destination that maximizes the right trade-off. Concretely, you place a router upstream that classifies the request and decides where to send it.

A realistic routing strategy comes down to a few rules, evaluated in order:

  1. Classify data sensitivity. If the prompt or retrieved context contains personal, health, or trade-secret data, the request does not leave: it is served by the self-hosted open model, full stop. This is a compliance rule, not an optimization.
  2. Estimate task difficulty. Direct factual answer from good context? A mid-sized open model is enough. Multi-step reasoning, synthesis of conflicting sources, agentic task? Allow escalation to a more capable model.
  3. Check the load. If the GPU cluster is near saturation and the request is non-sensitive, overflow to a closed API rather than degrading latency for everyone. This is the overflow pattern: the API acts as a spike absorber, not the main engine.
  4. Measure quality afterward. Sample answers, score them (LLM-as-judge, golden set), and adjust thresholds. A request poorly handled by the small model can be re-sent to the large one — this is the cascade pattern (try the cheapest first, escalate when confidence is insufficient).

Here is the skeleton of such routing logic:

def route(request):
    if contains_sensitive_data(request):
        return local_open_model       # never externalized

    if is_complex_reasoning_task(request):
        return closed_api             # premium justified

    if gpu_cluster_saturated():
        return closed_api             # overflow

    return local_open_model           # default, high volume

The result: nothing critical leaves, you serve the bulk of the volume at near-zero marginal cost on your own infrastructure, and you pay the API premium only where it adds real value — peak performance or spike absorption. This decoupling is also an insurance policy: if a vendor deprecates a model or changes pricing, you reroute non-sensitive traffic without rewriting your application.

The hidden cost of self-hosting

Self-hosting is not free just because the weights are. You need:

  • GPUs (purchased or rented), often under-utilized off-peak or saturated during peaks — hence the importance of sizing and batching;
  • availability: high availability, scaling, supervision, a recovery plan;
  • inference expertise (quantization, inference servers, memory optimization, context-length management);
  • continuous tracking of new model releases, which come fast: this quarter's open state of the art will be surpassed next quarter, and migrating has a cost.

For a team without MLOps culture, "free" open models can cost more than an API — in time, reliability, and opportunity cost. Conversely, a team that already masters inference amortizes those skills across several use cases and drives the per-token cost down. The right choice depends on your maturity, not a principle.

Conclusion

Sovereignty does not require going all-open. It requires knowing what leaves your perimeter and where it is processed. A self-hosted open model maximizes control and crushes cost at high volume; a closed API maximizes velocity, raw performance, and elasticity. In between, routing by sensitivity and difficulty offers the best compromise for most companies: it keeps the critical inside, serves volume cheaply, and reserves the premium for the cases that deserve it. Choose per use, measure your real TCO on your real load, and keep control of what truly matters: your data.

Further reading

  • AI sovereignty for the European enterprise: the 2026 strategic guide
  • Sovereign, GDPR-compliant RAG: the complete 2026 guide

Ready to ground your agents in your data?

Start free. First Knowledge Pulse audit in 60 seconds.

Start free