Merius — open models, served fast on our own GPUs

Model catalogue

Open models, on our own GPUs.

Every model runs on our dedicated B200/B300 fleet at FP8, reachable through one OpenAI-compatible API. Filter by what you’re building. Prices are per 1M tokens, in USD.

AllQwenZhipuMistral

Qwen3-30B-A3B

qwen/qwen3-30b-a3b

B200

Fast MoE model (3B active). Coding, reasoning, tool use, 32K context.

Throughput350 tok/s

Context32K

Input / 1M$0.14

Output / 1M$0.50

toolsreasoningjson modestructured outputs

FP8 · EU · US View details →

GLM-4.5-Air

z-ai/glm-4.5-air

B200

Lightweight GLM MoE (12B active). Reasoning, tool use, 131K context.

Throughput325 tok/s

Context128K

Input / 1M$0.20

Output / 1M$0.85

toolsreasoningjson modestructured outputs

FP8 · EU · US View details →

Mistral-Small-3.2-24B

mistralai/mistral-small-3.2-24b-instruct

B200

Mistral 24B dense instruct. Tool use, reasoning, 128K context.

Throughput179 tok/s

Context128K

Input / 1M$0.10

Output / 1M$0.30

toolsreasoningjson modestructured outputs

FP8 · EU · US View details →

Mistral-Nemo

mistralai/mistral-nemo

A100

Mistral 12B dense. Multilingual, tool use, 128K context.

Throughput160 tok/s

Context128K

Input / 1M$0.04

Output / 1M$0.15

toolsjson mode

FP8 · EU · US View details →

Qwen3-14B

qwen/qwen3-14b

A100

Qwen 14B dense. Reasoning, tool use, 40K context.

Throughput143 tok/s

Context40K

Input / 1M$0.10

Output / 1M$0.24

toolsreasoningjson mode

FP8 · EU · US View details →

Throughput is our measured tok/s on our fleet, cross-checked on Artificial Analysis.

Speed

A custom kernel, built for output speed.

We wrote our own inference kernel and run it on hardware we own, so open models serve fast. High output speed is what makes an AI coding workflow feel instant — autocomplete lands without a pause and agents finish in seconds, not minutes.

350tok/speak measured on Qwen3-30B-A3B · FP8

Measured throughput, per model

Qwen3-30B-A3B 350 tok/s

GLM-4.5-Air 325 tok/s

Mistral-Small-3.2-24B 179 tok/s

Mistral-Nemo 160 tok/s

Qwen3-14B 143 tok/s

Throughput at FP8, measured on our fleet.

OpenAI-compatible

Already on the OpenAI SDK? Change one line.

Point your existing OpenAI client at the Merius base URL, keep your request shape, your streaming code, and your tools. No new SDK, no rewrite — just a faster, cheaper endpoint.

Read the docs

python

from openai import OpenAI

client = OpenAI(
    base_url="https://api.merius.ai/v1",      # the one line that changes
    api_key="sk-merius-…",
)

response = client.chat.completions.create(
    model="qwen/qwen3-30b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Say hello in one line."},
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Why Merius

What you get that resellers can’t offer.

Our own hardware

We own and operate our B200/B300 fleet — we don’t resell a hyperscaler’s cloud. That means predictable performance and pricing we control, not someone else’s margin.

EU + US data residency

Datacenters in the EU and US, GDPR-compliant. Choose EU-only routing so requests are processed and kept in EU datacenters. A signed DPA is available.

OpenAI drop-in

Fully OpenAI-compatible. Point your existing client at our base URL. Works with Cursor, Cline, Claude Code, and Continue because they speak the same API.

Honest backpressure

Dedicated capacity per model. Under load we return a clean 429 in under a second so you can retry — we never silently queue you into a timeout.

Day-zero open models

We bring up new open models on our fleet as they ship, at FP8, with tool use and structured outputs where the model supports them.

Zero data retention

We do not store your prompts or completions, and we do not train on them. Every request is encrypted with TLS in transit.

Dedicated hosting

Reserve a private endpoint for production.

Serverless is the fast way to start. When you move to production, reserve GPUs on our own fleet: a single-tenant endpoint with committed latency and throughput, your model, in the region you choose. We bring it up and tune it to your workload — talk to us and we’ll size it.

Talk to usWe’ll size and bring up your endpoint, usually within a day.

Reserved, single-tenant GPUs

Your endpoint runs on GPUs we hold for you alone — committed latency and throughput, steady no matter how busy the platform is.

Your region, your model, your data

Pick EU or US, run a catalogue model or your own open weights. GDPR-compliant, signed DPA, no prompt or completion retention.

FAQ

Answers for developers.

How do I switch from OpenAI?

Change your client’s base URL to our endpoint and use your Merius key. The request and response shapes are the same, so no other code changes — set the base URL, set the key, pick a model slug.

Which models do you serve?

Open models on our own GPUs — Qwen3-30B-A3B, GLM-4.5-Air, Mistral-Small-3.2-24B, Mistral-Nemo, and Qwen3-14B today, with new open releases added as they ship. The models table above lists each one with its price, context length, and throughput.

How does pricing work?

Per token, in USD, billed only for what you use — no minimums, no seats, no idle charges on serverless. Cached input is billed at a lower rate automatically. Prices are shown per 1M tokens in the table above.

Is it really OpenAI-compatible?

Yes. The /v1/chat/completions and /v1/models endpoints follow the OpenAI schema, including streaming, tool calls, and structured outputs. Any OpenAI client or SDK works by pointing it at our base URL.

What about rate limits and uptime?

Each model has dedicated capacity. When a model is briefly saturated we return a clean 429 in under a second so you can retry — we don’t queue you into a timeout. Live status and uptime history are on our status page.

Where is my data processed?

In EU and US datacenters, GDPR-compliant, with EU-only routing available. We don’t store your prompts or completions and we don’t train on them. Traffic is encrypted with TLS in transit.

Can I get guaranteed throughput?

Yes — reserve a dedicated endpoint with committed latency and throughput for production traffic. See the Dedicated hosting section above, or contact us.

Do you offer models you don’t host yourselves?

For models outside our fleet we can route to upstream providers over the same OpenAI-compatible API, so you keep one integration. Self-hosted models run on our own GPUs; routed models are billed at their listed price.

Open models, served fast.

Open models, on our own GPUs.

Qwen3-30B-A3B

GLM-4.5-Air

Mistral-Small-3.2-24B

Mistral-Nemo

Qwen3-14B

A custom kernel, built for output speed.

Already on the OpenAI SDK? Change one line.

What you get that resellers can’t offer.

Our own hardware

EU + US data residency

OpenAI drop-in

Honest backpressure

Day-zero open models

Zero data retention

Reserve a private endpoint for production.

Reserved, single-tenant GPUs

Your region, your model, your data

Answers for developers.