In May 2026 there is no single "deep research" market. There are three: consumer chat-app research for one-off questions, agentic API infrastructure for programmatic enrichment at scale, and academic deep search bound to peer-reviewed indexes. Picking the wrong tier wastes money and burns hours. This is the map.
Triggered from a chat UI, returns one long-form cited report per query. Best for ad-hoc research where you'll read the answer yourself. Not built for programmatic enrichment of thousands of rows.
Programmatic deep research — call from your agent, get back structured JSON with citations and confidence scores. The right layer for Loop 3 strategy research, database enrichment, monitoring, and any workflow where the consumer of the output is software, not a human reading a report.
Exa Fast (sub-350ms p50 — fastest search API available), Exa Auto (general), Exa Deep (agentic research, field-level citations). 10 contents free per search. Strong on company/people/code search; embeddings find conceptually related content keyword search misses.
/research endpoint that breaks a question into sub-queries and returns cited synthesis. Best balance of speed-cost-predictability for RAG and chatbot use cases. Roadmap uncertainty after Nebius acquisition — worth monitoring before deep platform bet.
o3-deep-research / o4-mini-deep-research through the OpenAI API. Strong synthesis quality. But: on BrowseComp it benchmarks lower than Parallel Ultra at 6× the cost — pay the OpenAI tax for the brand familiarity, get worse accuracy per dollar. Use it when your stack is already OpenAI-native and integration matters more than benchmark.
| Platform | Cost / 1K (USD) | Accuracy | Relative cost-efficiency |
|---|---|---|---|
| Parallel 1200 (Ultra4x) | 1,200 | 48% | |
| Parallel 600 (Ultra2x) | 600 | 39% | |
| Parallel Ultra | 300 | 27% | |
| Parallel Pro | 100 | 17% | |
| Exa Research | 275 | 14% | |
| Perplexity Deep Research | 880 | 8% | |
| Claude Sonnet 4 + search | 1,168 | 6% | |
| GPT-4.1 + browsing | 53 | 1% |
Search over peer-reviewed papers (mostly Semantic Scholar-backed), not the open web. Not what you want for crypto/markets/agent research — these are the right tools when you need ML literature reviews, RL papers, or scientific grounding for swarm.ing's R&D.
The platforms aren't substitutes; they're job-specific. Here's how I'd map them to swarm.ing workflows.
For "what's the current state of MPC wallet UX in 2026?" — type, walk away, come back to a clean cited report. Best balance of reliability and effort. Claude Research if you want better writing. Perplexity if you need it in 3 minutes and the topic isn't load-bearing.
Programmatic deep research called from YAML pipelines, structured JSON outputs with confidence scores, citations, excerpts. State of the art on BrowseComp. Far better $/accuracy than OpenAI or Anthropic deep-research APIs.
Sub-second latency, RAG-ready, $5–$9 per 1K requests. Exa for semantic relevance, Brave for independent index and lowest latency. Skip deep-research APIs at this tier — they're too slow and too expensive for live signal work.
When the sources are internal strategy docs, partner contracts, or trading logs — closed-corpus tools eliminate web hallucination entirely. NotebookLM for multimedia outputs (audio briefings on the road), Claude Projects when synthesis quality matters.
"Compare every DEX aggregator with sub-1bps slippage on cross-chain swaps" — enumerative tasks need parallelism, not depth. Manus spins 100 subagents; Parallel FindAll returns structured entity datasets with per-row enrichment. Either beats sequential deep-research for this shape of problem.
When you need to know "what's the SOTA on hierarchical RL with offline-to-online finetuning" — academic deep-research tools are the right layer, not Perplexity. Elicit's structured review process pairs well with your Obsidian research vault workflow.
Purpose-built deep-research APIs decisively beat general-LLM-with-web-search at every price point on BrowseComp. Claude + web_search at $1,168/1K hits 6% accuracy where Parallel Pro at $100/1K hits 17%. The moat is the orchestration harness, not the LLM — same lesson as Factory Droid. For Loop 3 work, default to Parallel; only fall back to LLM-with-search when integration friction outweighs cost.
ChatGPT, Claude, Gemini, and Perplexity all produce roughly the same shape of output (long-form cited report) in roughly the same time. The differentiation is now trust: Perplexity got caught silently downgrading paid queries to cheaper models. Gemini reads your Gmail/Drive. Claude has Skills. ChatGPT has the largest user-habit base. Pick on trust posture, not on raw output quality.
When the source is yours — internal docs, partner agreements, your own research notes — closed-corpus tools eliminate web hallucination entirely. For anything touching Openclaw's proprietary strategy work, this is the only safe option. NotebookLM's audio overviews convert dense research into commute-time briefings; genuinely useful for Loop 3 governance review on the go.
Don't ask Perplexity to find SOTA RL papers; ask Elicit or Undermind. Don't ask Elicit about crypto market structure; ask Parallel. The corpora are different (Semantic Scholar vs open web), the citation quality requirements are different, and trying to use general deep-research for academic synthesis produces hallucinated citations every time. Keep these workflows separate in your Obsidian vault.
While everyone else iterates on the "one deep agent, long task" pattern, Manus Wide Research parallelizes (100 subagents on one task) and Genspark uses Mixture-of-Agents (8 LLMs + 80 tools). Both architectures are closer to what Hermes is converging toward than ChatGPT Deep Research is. Worth studying their orchestration patterns, not just their output.
Parallel.ai Task API for Loop 3 programmatic research (you already have this). Exa Fast for Loop 1 sub-second signal enrichment when nv protocol needs web context. Elicit for RL/agent literature reviews into your Obsidian research vault. NotebookLM for synthesizing internal docs into audio briefings. Skip the chat-app deep-research products for production — useful for personal ad-hoc, not load-bearing for the platform.