live ttft228ms

Stop watching your agent type.

A drop-in OpenAI replacement that runs your agents 14× faster. Built by Decart.

Now serving:Kimi K2.6DeepSeek V4 (Flash & Pro)GPT-OSSQwen
stream.py
from openai import OpenAI client = OpenAI( base_url="https://api.cogito.decart.ai/v1", api_key=os.environ["COGITO_API_KEY"], ) stream = client.chat.completions.create( model="deepseek-v4-flash", messages=[{"role": "user", "content": "Why is the sky blue?"}], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="")

The problem

Your agent spends most of its life waiting.

A single agent task is ~12 model calls — reasoning steps, tool invocations, follow-ups. Each one generates a few hundred tokens. At a typical 70 tok/s, you watch the cursor blink.

Typical inference

70 tok/s

43s

12 calls × ~250 tokens × 70 tok/s ≈ 43 seconds of dead time per task.

Cogito

1000 tok/s

3s

Same 12 calls. ~3 seconds of inference. The rest is your code.

Same task. Same model weights. Fourteen times less waiting.

What 14× looks like

Same prompt. Same model. Cogito's done first.

An agent reasoning + tool call + summary, streamed character-by-character at the actual throughputs. No live API call — the speeds are simulated to keep the page stable.

Typical inference

70 tok/s

elapsed 0.00s

measured 0 tok/s

 

Cogito

1000 tok/s

elapsed 0.00s

measured 0 tok/s

 

Throughputs are an idealized comparison against a 70 tok/s baseline. Real-world Cogito throughput varies by model and prompt; see the per-model TTFT and TPS on the catalog below.

The stack

Built on DOS.

DOS — the Decart Optimization Stack — is what we've spent the last two years building. It's the multi-silicon serving layer that runs Lucy 2.0, Decart's diffusion model, at sub-50ms on AWS Trainium. Cogito is what happens when we point that stack at the open-source LLMs you already use.

The gateway is OpenAI-compatible at the wire. Underneath, DOS picks the right silicon per model — Trainium for throughput-tuned MoEs, NVIDIA Blackwell for the frontier configurations — and your bytes come back the same way no matter what's on the other end.

Lucy 2.0: sub-50ms·Trainium·Blackwell·14× typical

Your code

openai-python · openai-node · curl

Cogito gateway

OpenAI-compatible · TLS · ALB

DOS — multi-silicon dispatch

Decart Optimization Stack

AWS Trainium

throughput-tuned · Lucy 2.0 < 50ms

NVIDIA Blackwell

frontier MoE · DeepSeek V4

Catalog

Frontier open weights. Curated, day one.

Kimi K2.6 · DeepSeek V4 (Flash & Pro) · GPT-OSS · Qwen. Hot the moment they ship. Same per-token price for your fine-tuned variants.

Browse all models

OpenAI

GPT-OSS 120B

120B (MoE, ~5B active)

OpenAI's first open-weight model since GPT-2. Mixture-of-experts 120B with strong general reasoning at a price that's hard to beat.

Contact sales

Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.

DeepSeek

DeepSeek V4 Pro

Frontier MoE

DeepSeek's frontier model. 1M-token context, frontier-class reasoning, and a price tag that makes proprietary alternatives hard to justify.

Contact sales

Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.

DeepSeek

DeepSeek V4 Flash

Mid-tier MoE

The cheap workhorse with a 1M-token window. Built for high-volume pipelines where the bill matters as much as the answer.

Contact sales

Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.

Moonshot

Kimi K2.6

1T MoE (~32B active)

Moonshot's flagship MoE. Trained for long-horizon agentic workflows; the model engineering teams reach for when the cheap models stop being enough.

TTFT
280ms
Context
256K
Input / 1M
$0.68
Cached / 1M
$0.144
Output / 1M
$3.41

Alibaba

Qwen3 235B

235B MoE (~22B active)

Alibaba's open-weight flagship. Strong multilingual coverage and tool use at a price point that holds up against the proprietary mid-tier.

Contact sales

Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.

Moonshot

Kimi K2.6 Fast

1T MoE (~32B active)

Kimi K2.6 on a low-latency route — the first token comes back faster than the default, in exchange for lower per-route capacity.

Contact sales

Capacity isn't provisioned yet. Reach out to cogito@decart.ai for access.

Drop-in

Change one line. Keep the rest.

The wire format matches OpenAI exactly. SDKs work without changes. Tools, structured outputs, streaming — all spec-compliant.

Change base_url. That's the diff.

The OpenAI Python and Node SDKs both take a base_url override. Point it at us and the rest of your code is untouched.

- base_url="https://api.openai.com/v1"+ base_url="https://api.cogito.decart.ai/v1"

Byte-for-byte OpenAI spec.

If it works against api.openai.com, it works against us.

  • Streaming SSE with delta.content
  • Function / tool calling, parallel calls
  • JSON-schema structured outputs
  • prompt_tokens_details.cached_tokens

Built for operators.

The infrastructure niceties incumbents skip — set once, then forget about them.

  • Hard spend caps per org & per key
  • Token-aware rate limits (no surprise 429s)
  • Request ID on every response & error
  • Zero retention by default

Built by Decart

We do real-time inference for a living.

Cogito is Decart's LLM inference layer. We've spent the last two years building DOS — the Decart Optimization Stack — to push real-time AI past what generic infrastructure allows. Lucy 2.0, our diffusion model, generates frames at sub-50ms on AWS Trainium. Cogito is what happens when we point that stack at the open-source LLMs you already use.

The underlying engineering wasn't built for LLMs first — it was built for real-time. That's why the gap between us and a typical inference provider isn't a few percent; it's 14×.

Founded
Decart, 2024
Proof point
Lucy 2.0 < 50ms
Silicon
Trainium + Blackwell
Wire format
OpenAI-compatible

Public per-token pricing — no minimums, no commits, no overage charges. Premium speed tier (1000+ tok/s) available on request.

See pricing

Stop watching your agent type.

Cogito, ergo ship.

$5 in free credits. No credit card. Five minutes from sign-up to your first streamed response.