Field Notes: The Multimodal Gap in AI Coding Agents

🤖 Ghostwritten by Claude Opus 4.5 · Curated by Tom Hundley

This article was written by Claude Opus 4.5 based on field notes from Tom Hundley, and curated for publication.

There's a moment when you're deep in a coding session with an AI agent and you realize: I need a quick placeholder image. Or a diagram. Or a banana.

Yes, a banana. More specifically, invoking "nano banana" in an agent editor to test whether image generation actually works inline. It's the kind of absurdist sanity check developers do.

Here's what I discovered: in Google Antigravity and Gemini, this just... works. You can invoke image generation directly within the agent editor. Need a quick mockup? A placeholder for UI testing? A diagram to illustrate a concept? It's right there, integrated into the flow.

Then I switched to Claude Code.

No image generation. The capability simply isn't there.

Tried Codex next. The response was explicit: "I don't have access to an image generation model here—only the CLI tools shown (shell, file I/O, etc.)."

The Workaround vs. The Win

Now, can you script around this? Absolutely. You can write a quick script that hits the DALL-E API or Imagen or Stable Diffusion. Pipe the result to a file. Move on with your day.

But here's the thing: having to break flow to set up an API call, handle credentials, write error handling, manage rate limits—that's friction. It's cognitive overhead. It's the difference between thinking about your problem and thinking about your tooling.

When image generation is natively integrated, you stay in the zone. You describe what you need, it appears, you keep building. The same reason we like IDEs with integrated terminals instead of switching windows constantly.

What This Reveals About Agent Strategy

This isn't really about images. It's about what these AI agents are designed to be.

Google's tools (Antigravity and Gemini) are positioning themselves as multimodal development environments. Code, images, eventually audio and video—all accessible through the same interface. They're betting that developers want a Swiss Army knife. With Antigravity announced alongside Gemini 3, Google is clearly all-in on the multimodal agent-first IDE.

Claude Code and Codex are currently positioned as pure code/text tools. Excellent at what they do—often superior in reasoning and code quality—but deliberately scoped. The philosophy seems to be: do fewer things, do them exceptionally well.

Neither approach is wrong. But if you're building something visual—a frontend, a game, documentation with diagrams—the multimodal agents have a real workflow advantage right now.

The Obvious Opportunity

Claude needs to build an image generation model. Or at minimum, integrate one.

This isn't about feature parity for its own sake. It's about the natural evolution of what an AI coding assistant should be. Developers don't work in text-only environments. We generate diagrams, mockups, icons, screenshots, visual documentation. Having to exit the agent to do any of that is a seam in the experience.

OpenAI's Codex (at least in its CLI incarnation) has the same gap. They obviously have the image models internally—GPT-4 can generate images in ChatGPT. The question is whether they'll surface that capability in their coding tools.

My prediction: within the next six months, this capability gap will close. The pressure from Google's multimodal push will force it. The question is whether Claude and OpenAI will build native capabilities or just wire up API connections to existing models.

The former is better. When image generation is a first-class citizen of the model architecture—trained alongside code understanding—you get more coherent results. The model understands context across modalities rather than just shuttling text descriptions to a separate system.

Practical Implications (For Now)

If you're choosing an AI coding agent today and visual generation matters to your workflow:

Google Antigravity: Google's agent-first IDE with native Gemini integration. Image generation is baked in.
Gemini: Also has native multimodal capabilities. Worth evaluating for image-heavy workflows.
Claude Code: Superior reasoning and code quality, but plan on API scripting for images.
Codex CLI: Same situation. Strong text/code, no native images.

If your work is primarily backend, infrastructure, or text-based tooling, the multimodal gap might not matter much. The pure code agents are excellent at what they do.

But if you're building anything with a visual component, those seamless image generation moments add up. It's one of those capabilities that sounds like a gimmick until you need it—and then you really need it.

The Bigger Picture

We're watching AI coding tools differentiate along capability axes in real time. Some are going wide (multimodal everything). Some are going deep (best-in-class text reasoning). Eventually, the successful ones will need to do both.

For now, it's worth knowing where the gaps are. And if you're in an Antigravity or Gemini session and you need a quick banana to test something, you can just... ask for it.

That's genuinely nice.

Field notes from actual development sessions across multiple AI agents. Your mileage may vary depending on version, configuration, and whether you actually need nano bananas.