Stop watching your agent type.
A drop-in OpenAI replacement that runs your agents 14× faster. Built by Decart.
from openai import OpenAI
client = OpenAI(
base_url="https://api.cogito.decart.ai/v1",
api_key=os.environ["COGITO_API_KEY"],
)
stream = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Why is the sky blue?"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")The problem
Your agent spends most of its life waiting.
A single agent task is ~12 model calls — reasoning steps, tool invocations, follow-ups. Each one generates a few hundred tokens. At a typical 70 tok/s, you watch the cursor blink.
Typical inference
70 tok/s
43s
12 calls × ~250 tokens × 70 tok/s ≈ 43 seconds of dead time per task.
Cogito
1000 tok/s
3s
Same 12 calls. ~3 seconds of inference. The rest is your code.
Same task. Same model weights. Fourteen times less waiting.
What 14× looks like
Same prompt. Same model. Cogito's done first.
An agent reasoning + tool call + summary, streamed character-by-character at the actual throughputs. No live API call — the speeds are simulated to keep the page stable.
Typical inference
70 tok/s
elapsed 0.00s
measured 0 tok/s
Cogito
1000 tok/s
elapsed 0.00s
measured 0 tok/s
Throughputs are an idealized comparison against a 70 tok/s baseline. Real-world Cogito throughput varies by model and prompt; see the per-model TTFT and TPS on the catalog below.
The stack
Built on DOS.
DOS — the Decart Optimization Stack — is what we've spent the last two years building. It's the multi-silicon serving layer that runs Lucy 2.0, Decart's diffusion model, at sub-50ms on AWS Trainium. Cogito is what happens when we point that stack at the open-source LLMs you already use.
The gateway is OpenAI-compatible at the wire. Underneath, DOS picks the right silicon per model — Trainium for throughput-tuned MoEs, NVIDIA Blackwell for the frontier configurations — and your bytes come back the same way no matter what's on the other end.
Your code
openai-python · openai-node · curl
Cogito gateway
OpenAI-compatible · TLS · ALB
DOS — multi-silicon dispatch
Decart Optimization Stack
AWS Trainium
throughput-tuned · Lucy 2.0 < 50ms
NVIDIA Blackwell
frontier MoE · DeepSeek V4
Catalog
Frontier open weights. Curated, day one.
Kimi K2.6 · DeepSeek V4 (Flash & Pro) · GPT-OSS · Qwen. Hot the moment they ship. Same per-token price for your fine-tuned variants.
OpenAI
GPT-OSS 120B
OpenAI's first open-weight model since GPT-2. Mixture-of-experts 120B with strong general reasoning at a price that's hard to beat.
Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.
DeepSeek
DeepSeek V4 Pro
DeepSeek's frontier model. 1M-token context, frontier-class reasoning, and a price tag that makes proprietary alternatives hard to justify.
Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.
DeepSeek
DeepSeek V4 Flash
The cheap workhorse with a 1M-token window. Built for high-volume pipelines where the bill matters as much as the answer.
Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.
Moonshot
Kimi K2.6
Moonshot's flagship MoE. Trained for long-horizon agentic workflows; the model engineering teams reach for when the cheap models stop being enough.
- TTFT
- 280ms
- Context
- 256K
- Input / 1M
- $0.68
- Cached / 1M
- $0.144
- Output / 1M
- $3.41
Alibaba
Qwen3 235B
Alibaba's open-weight flagship. Strong multilingual coverage and tool use at a price point that holds up against the proprietary mid-tier.
Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.
Moonshot
Kimi K2.6 Fast
Kimi K2.6 on a low-latency route — the first token comes back faster than the default, in exchange for lower per-route capacity.
Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.
Drop-in
Change one line. Keep the rest.
The wire format matches OpenAI exactly. SDKs work without changes. Tools, structured outputs, streaming — all spec-compliant.
Change base_url. That's the diff.
The OpenAI Python and Node SDKs both take a base_url override. Point it at us and the rest of your code is untouched.
- base_url="https://api.openai.com/v1"+ base_url="https://api.cogito.decart.ai/v1"
Byte-for-byte OpenAI spec.
If it works against api.openai.com, it works against us.
- ▸Streaming SSE with
delta.content - ▸Function / tool calling, parallel calls
- ▸JSON-schema structured outputs
- ▸
prompt_tokens_details.cached_tokens
Built for operators.
The infrastructure niceties incumbents skip — set once, then forget about them.
- ▸Hard spend caps per org & per key
- ▸Token-aware rate limits (no surprise 429s)
- ▸Request ID on every response & error
- ▸Zero retention by default
Built by Decart
We do real-time inference for a living.
Cogito is Decart's LLM inference layer. We've spent the last two years building DOS — the Decart Optimization Stack — to push real-time AI past what generic infrastructure allows. Lucy 2.0, our diffusion model, generates frames at sub-50ms on AWS Trainium. Cogito is what happens when we point that stack at the open-source LLMs you already use.
The underlying engineering wasn't built for LLMs first — it was built for real-time. That's why the gap between us and a typical inference provider isn't a few percent; it's 14×.
- Founded
- Decart, 2024
- Proof point
- Lucy 2.0 < 50ms
- Silicon
- Trainium + Blackwell
- Wire format
- OpenAI-compatible
Public per-token pricing — no minimums, no commits, no overage charges. Premium speed tier (1000+ tok/s) available on request.
See pricingStop watching your agent type.
Cogito, ergo ship.
$5 in free credits. No credit card. Five minutes from sign-up to your first streamed response.