🤖 Ghostwritten by Claude · Curated by Tom Hundley
This article was written by Claude and curated for publication by Tom Hundley.
ChatGPT for research. Claude for code. Grok for real-time data. Gemini for everything else. After testing every major AI tool in production, heres the definitive breakdown of what actually works—and what doesnt.
If youre paying for maximum subscriptions across every major AI platform (yes, all of them), youve probably noticed something: no single model wins at everything. The frontier has fragmented. Each lab has found its niche, and knowing which tool to reach for—and when—has become a genuine skill.
This isnt a benchmark comparison. Benchmarks are useful until they arent. What follows is a practitioners perspective, informed by daily usage across production work, combined with community sentiment from developers, researchers, and industry voices who are running into the same decisions.
The TL;DR: Claude dominates coding. ChatGPT excels at research and strategic thinking. Grok owns real-time social data. Gemini 3 might be the best all-rounder weve seen. And the browser wars? Theyve gotten interesting.
Lets start where theres the least debate.
Claude 4.5 Opus is, by community consensus, the best AI model for software engineering available today. This isnt opinion—its the first model to score above 80% on SWE-bench Verified, the gold standard for real-world coding tasks. On Anthropics internal engineering test—the same one given to prospective hires—Opus 4.5 scored higher than any human candidate ever.
The praise has been unusually consistent. McKay Wrigley calls Claude Code + Opus 4.5 the best AI coding tool in the world. Simon Willison spent two days with early access, shipping an alpha release of sqlite-utils that included 20 commits, 39 files changed, 2,022 additions and 1,173 deletions—with Opus doing most of the work.
A LessWrong analysis captured the sentiment: No model since GPT-4 has come close to the level of universal praise that I have seen for Claude Opus 4.5.
Claudes strength is depth, not speed. For quick code completions while youre typing, you might prefer something faster. But when the task requires thinking through architecture, tradeoffs, and implications? Claude is the answer.
Pricing: $5/million input, $25/million output—down from $15/$75 for the previous Opus. Still premium, but the quality justifies it for serious engineering work.
OpenAIs ChatGPT remains the most polished consumer AI product. For research, strategic thinking, and general-purpose intelligence, its hard to beat. GPT-5.2 dropped December 11, 2025, three weeks ahead of schedule, as OpenAI responded to Gemini 3s launch.
ChatGPT excels at synthesis. When you need to think through a complex business problem, explore strategic options, or research a topic comprehensively, the o1 reasoning models and GPT-5.2 deliver. The chain of thought mechanism produces more accurate, in-depth answers for complex reasoning tasks.
A March 2025 update made GPT-4o feel more intuitive, creative, and collaborative, following instructions more accurately. For multimodality and real-time interaction, 4o remains the most capable option.
Heres where it gets honest: ChatGPTs tool integration is dismal.
Jason Calacanis from the All-In podcast has mentioned the same complaint—ChatGPT has become too careful about data. Ask for specific numbers, market data, or quantitative analysis, and youll often hit walls. OpenAIs hedging on data-sensitive responses has made it harder to pull actionable numbers from conversations.
The lack of native MCP (Model Context Protocol) support until October 2025 was embarrassing for a company positioning itself as the AI leader. Even now, OpenAI warns developers that their MCP feature is powerful but dangerous.
800 million weekly active users cant all be wrong. But for coding or real-time data? Look elsewhere.
xAIs Grok is the sleeper in this lineup. It doesnt get the press coverage of ChatGPT or Claude, but it has genuine strengths that the others cant match.
Grok doesnt waste your time. Where ChatGPT often buries the answer in caveats and pleasantries, Grok gets to the point. This isnt just personality—its a design choice that makes it genuinely more useful for certain workflows. Technical discussions, therapy-style reflection, direct feedback—Grok handles all of it without the protective padding that makes other models feel like theyre managing your emotions.
This is Groks killer feature. With direct access to X (Twitter) data, Grok provides real-time sentiment analysis, trending topics, and social context that no other model can match. If you want to know what people are saying about something right now, Grok rules.
According to independent reviews, Grok 4.1 jumped from #33 to top-3 on LMArena in one release—a 30-position leap unprecedented in the space. The model achieved a 4.22% hallucination rate (65% reduction from 12%) and ranks #1 on EQ-Bench3.
Heres the hard truth: xAI is not leading in any tooling category. No official IDE integration. No MCP support. No A2A (Agent-to-Agent) protocol adoption. The developer experience is lagging badly.
My theory: xAI has its sights set on robotics and embodied AI. The integration of Grok into Teslas Optimus isnt just a product feature—its the strategic direction. Elon Musk has publicly stated that 80% of Teslas future value will come from Optimus and related AI businesses.
SpaceXs $2 billion investment in xAI is building toward a cohesive AI-driven platform spanning aerospace, telecommunications, automotive, and robotics. Chat interfaces might not be the priority.
For coding? Community members note that older Claude Sonnet variants or GPT family models produce cleaner, more robust code.
Googles Gemini 3 is the surprise of 2025. After years of playing catch-up, Google delivered something genuinely competitive—arguably leading—across multiple dimensions.
Gemini 3 Pro achieved the first-ever 1500+ Elo score on LMArena. It tops the WebDev Arena leaderboard with an impressive 1487 Elo. On MMMU-Pro (multimodal understanding), it scores 81.0% versus Claudes 72.4% and GPT-5.2s 68.9%. The ScreenSpot-Pro benchmark—screenshot understanding—shows Gemini at 72.7%, GPT at 3.6%.
Thats a 20x gap on visual understanding.
This is Geminis structural advantage. The same way Grok owns social data through X, Gemini owns web search through Google. For research requiring current information, this integration matters. Deep Research with Gemini 3 Pro is, according to Google, their most factual model trained to minimize hallucinations.
Google Antigravity—their $2.4 billion Windsurf acquisition turned product—is Googles answer to Cursor and Claude Code. Early testing shows promising results: 35% higher accuracy in resolving software engineering challenges versus Gemini 2.5 Pro.
The Manager View—mission control for orchestrating multiple agents—represents a genuinely new paradigm for AI-assisted development.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Gemini 3 Pro | $2.00 | $12.00 |
| Claude Opus 4.5 | $5.00 | $25.00 |
| GPT-5.2 | $5.00 | $15.00 |
At 60% lower cost for comparable reasoning, Gemini 3 is the economical choice for scale.
The $250/month Ultra tier has been criticized as expensive, but the standard pricing is competitive.
The AI coding assistant space has bifurcated into two philosophies: Cursors speed-first IDE integration and Claude Codes depth-first terminal approach.
Cursors proprietary Composer 1 model is fast. In head-to-head testing, Composer 1 matched or even slightly outperformed Sonnet 4.5 in overall coding quality with more than half the time and far fewer tokens.
One developer ran a comparison: Cursor built an entire application in 2 minutes 26 seconds. Claude Code took significantly longer.
But speed isnt everything.
Claude Codes approach is thoughtful over fast. It pauses to strategize where Cursor is too eager to implement without full understanding. For complex refactoring, multi-file changes, and architectural decisions, Claude Codes deliberate approach produces better outcomes.
Many developers use both. As one practitioner put it: Cursor is my go-to when I need speed inside the editor; Claude is my go-to when I need thoughtful output and tool-driven workflows.
You can even run Claude Code inside Cursors terminal—best of both worlds.
The browser wars are back, and this time its AI-native versus AI-added.
Comet launched October 2, 2025 with a research-first philosophy. It excels at multi-source synthesis with transparent citations. Every answer shows where the information came from.
Key features:
Atlas launched October 21, 2025 with automation as its focus. The persistent ChatGPT sidebar handles summarizing, drafting, analyzing. Browser Memories recall past activity for personalized suggestions.
Expert testing consistently shows that Perplexitys Comet could do things somewhat quicker and didnt have many (if any) significant glitches. Users report 30-40% faster execution for routine agentic workflows with Comet.
Both require premium subscriptions for full features ($20/month for Pro tiers). For pure research speed, Comet wins. For deep ChatGPT integration and task automation, Atlas has its place.
Just when it seemed like Chrome was losing relevance, Google announced Gemini integration in December 2025. Native AI in the address bar. Multi-tab summarization. Agentic capabilities for handling tasks like booking and ordering.
As of December 11, 2025, Gemini is rolling out to Chrome on iOS. This isnt a bolt-on feature—Google describes it as the most significant upgrade to Chrome in its history.
For users already in the Google ecosystem, this might eliminate the need for separate AI browsers entirely.
For visual content, the hierarchy is straightforward: Google leads, OpenAI competes, everyone else follows.
Veo 3 produces 4K videos with synchronized dialogue, background music, and environmental sounds from text prompts. The visual fidelity comes from training on YouTubes massive video dataset—Google understands motion, lighting, and physics in ways competitors dont.
For production-quality video work, Veo is the answer.
Sora 2 can generate clips up to 60 seconds—significantly longer than Veo. The editing tools (Remix, Loop, Blend) make it flexible for creators who want to iterate.
Sora is fun. Its good for experimentation and creative exploration. But for serious production work, Veos cinematic quality wins.
After rigorous testing: Sora emerges as the winner with smoother motions, fitting audio, and fewer hallucinations.
The truth is both are excellent for different use cases. Veo for professional-grade output. Sora for creative flexibility and longer-form content.
After running all of these tools in production, heres how the stack actually breaks down:
| Task | Best Tool | Why |
|---|---|---|
| Production coding | Claude 4.5 Opus | SWE-bench leader, handles complexity |
| Quick code completions | Cursor + Composer 1 | Speed, IDE integration |
| Strategic research | ChatGPT (o1/GPT-5.2) | Synthesis, chain-of-thought reasoning |
| Real-time social data | Grok 4.1 | X integration, current events |
| Multimodal analysis | Gemini 3 Pro | 20x better screenshot understanding |
| Cost-sensitive scale | Gemini 3 Pro | 60% cheaper than alternatives |
| Research browsing | Perplexity Comet | Citations, speed, reliability |
| Task automation | OpenAI Atlas or Chrome+Gemini | Agent modes, workflow integration |
| Production video | Google Veo 3 | Cinematic quality, audio sync |
| Creative video | OpenAI Sora 2 | Flexibility, longer clips, editing tools |
If forced to drop one subscription, Perplexity would go first—but only because Comet is its killer feature and I can access that without the full subscription. The search engine itself has been somewhat eclipsed by Geminis Deep Research and ChatGPTs web browsing.
Everything else? Non-negotiable. Each has its lane, and the productivity gains from using the right tool for each task compound over time.
The AI landscape has fragmented into specializations. The winners arent the people chasing a single best model—theyre the ones whove learned which tool to reach for, and when.
This article is a live example of the AI-enabled content workflow we build for clients.
| Stage | Who | What |
|---|---|---|
| Research | Claude Opus 4.5 | Analyzed current industry data, studies, and expert sources |
| Curation | Tom Hundley | Directed focus, validated relevance, ensured strategic alignment |
| Drafting | Claude Opus 4.5 | Synthesized research into structured narrative |
| Fact-Check | Human + AI | All statistics linked to original sources below |
| Editorial | Tom Hundley | Final review for accuracy, tone, and value |
The result: Research-backed content in a fraction of the time, with full transparency and human accountability.
Were an AI enablement company. It would be strange if we didnt use AI to create content. But more importantly, we believe the future of professional content isnt AI vs. Human—its AI amplifying human expertise.
Every article we publish demonstrates the same workflow we help clients implement: AI handles the heavy lifting of research and drafting, humans provide direction, judgment, and accountability.
Want to build this capability for your team? Lets talk about AI enablement →
Discover more content: