Qwen3-30B-A3B
qwen/qwen3-30b-a3b
Fast MoE model (3B active). Coding, reasoning, tool use, 32K context.
OpenAI-compatible · EU + US
A custom inference kernel on our own B200/A100 fleet — up to 350 tok/s on open models, over one OpenAI-compatible API. Transparent per-token pricing, EU and US datacenters.
Model catalogue
Every model runs on our dedicated B200/B300 fleet at FP8, reachable through one OpenAI-compatible API. Filter by what you’re building. Prices are per 1M tokens, in USD.
qwen/qwen3-30b-a3b
Fast MoE model (3B active). Coding, reasoning, tool use, 32K context.
z-ai/glm-4.5-air
Lightweight GLM MoE (12B active). Reasoning, tool use, 131K context.
mistralai/mistral-small-3.2-24b-instruct
Mistral 24B dense instruct. Tool use, reasoning, 128K context.
mistralai/mistral-nemo
Mistral 12B dense. Multilingual, tool use, 128K context.
qwen/qwen3-14b
Qwen 14B dense. Reasoning, tool use, 40K context.
Throughput is our measured tok/s on our fleet, cross-checked on Artificial Analysis.
Speed
We wrote our own inference kernel and run it on hardware we own, so open models serve fast. High output speed is what makes an AI coding workflow feel instant — autocomplete lands without a pause and agents finish in seconds, not minutes.
Measured throughput, per model
Throughput at FP8, measured on our fleet.
OpenAI-compatible
Point your existing OpenAI client at the Merius base URL, keep your request shape, your streaming code, and your tools. No new SDK, no rewrite — just a faster, cheaper endpoint.
from openai import OpenAI
client = OpenAI(
base_url="https://api.merius.ai/v1", # the one line that changes
api_key="sk-merius-…",
)
response = client.chat.completions.create(
model="qwen/qwen3-30b-a3b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say hello in one line."},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Why Merius
We own and operate our B200/B300 fleet — we don’t resell a hyperscaler’s cloud. That means predictable performance and pricing we control, not someone else’s margin.
Datacenters in the EU and US, GDPR-compliant. Choose EU-only routing so requests are processed and kept in EU datacenters. A signed DPA is available.
Fully OpenAI-compatible. Point your existing client at our base URL. Works with Cursor, Cline, Claude Code, and Continue because they speak the same API.
Dedicated capacity per model. Under load we return a clean 429 in under a second so you can retry — we never silently queue you into a timeout.
We bring up new open models on our fleet as they ship, at FP8, with tool use and structured outputs where the model supports them.
We do not store your prompts or completions, and we do not train on them. Every request is encrypted with TLS in transit.
Serverless is the fast way to start. When you move to production, reserve GPUs on our own fleet: a single-tenant endpoint with committed latency and throughput, your model, in the region you choose. We bring it up and tune it to your workload — talk to us and we’ll size it.
Talk to usWe’ll size and bring up your endpoint, usually within a day.
Your endpoint runs on GPUs we hold for you alone — committed latency and throughput, steady no matter how busy the platform is.
Pick EU or US, run a catalogue model or your own open weights. GDPR-compliant, signed DPA, no prompt or completion retention.
FAQ
Change your client’s base URL to our endpoint and use your Merius key. The request and response shapes are the same, so no other code changes — set the base URL, set the key, pick a model slug.
Open models on our own GPUs — Qwen3-30B-A3B, GLM-4.5-Air, Mistral-Small-3.2-24B, Mistral-Nemo, and Qwen3-14B today, with new open releases added as they ship. The models table above lists each one with its price, context length, and throughput.
Per token, in USD, billed only for what you use — no minimums, no seats, no idle charges on serverless. Cached input is billed at a lower rate automatically. Prices are shown per 1M tokens in the table above.
Yes. The /v1/chat/completions and /v1/models endpoints follow the OpenAI schema, including streaming, tool calls, and structured outputs. Any OpenAI client or SDK works by pointing it at our base URL.
Each model has dedicated capacity. When a model is briefly saturated we return a clean 429 in under a second so you can retry — we don’t queue you into a timeout. Live status and uptime history are on our status page.
In EU and US datacenters, GDPR-compliant, with EU-only routing available. We don’t store your prompts or completions and we don’t train on them. Traffic is encrypted with TLS in transit.
Yes — reserve a dedicated endpoint with committed latency and throughput for production traffic. See the Dedicated hosting section above, or contact us.
For models outside our fleet we can route to upstream providers over the same OpenAI-compatible API, so you keep one integration. Self-hosted models run on our own GPUs; routed models are billed at their listed price.