Which LLM Really Handles Complex 3D Coding? A Stress Test You Won’t See in Demos

I gave GPT-5, Claude Sonnet 4, Gemini Flash 2.5, DeepSeek DeepThink (R1), and Amazon Kiro the same complex coding task: a 3D neural network simulation in Three.js. Only one model came close to production-level quality.

Why I Ran This Experiment

Most LLM reviews show simple tasks: “write a landing page” or “generate a Python script.” But those don’t really tell us how these models behave under real production-like complexity. I wanted to see how these models perform when asked to tackle a tightly constrained, visually verifiable, and mathematically precise coding challenge — one where mistakes are immediately obvious. So I wrote a prompt about 1,500 tokens long, spelling out every requirement of a 3D neural network visualization in Three.js. Dashboard

The article below includes only a structured summary of those requirements (for readability). But make no mistake: the actual spec was meticulous, covering exact math formulas, timing phases, firing rules, and strict boundary enforcement.

The Benchmark Problem

Task (summary of original prompt):

Create a professional 3D neural network visualization using Three.js that demonstrates dynamic shape morphing, neural activity simulation, interactive camera controls, and perfect geometric containment of all neural elements.

Highlights of the Requirements

🔹 Shapes & Morphing

Cycle through 5 shapes in strict order: Sphere → Cone → Cube → Pyramid → Cylinder
Smooth morphing transitions, infinite loop

🔹 Neurons & Connections

Exactly 1,200 neurons (white wireframe spheres)
At least 3 connections per neuron, proximity-based
Zero tolerance for neurons or links outside the shape boundary

🔹 Activity Simulation

Random neurons fire every 150ms (~1.5%)
Red “meteor effect” travels along links.
Destination neuron glows red and scales 3x for 800ms, then resets.

🔹Camera & Interaction

Interactive rotation with mouse; gentle auto-rotation fallback.
Cinematic zoom-in (3s) and zoom-out (3s), synced with morph transitions.
Smooth easing, no jerky movements.

🔹Performance

60fps target.
Memory-efficient cleanup of effects.
Responsive resizing.
Single HTML file with Three.js via CDN.

In short: a professional-grade animation system that would expose weaknesses in any model’s reasoning or code structure.

The Test Setup

I tested five LLMs:

Chat interfaces: GPT-5, Claude Sonnet 4, Gemini Flash 2.5, DeepSeek DeepThink (R1)
Software development environment: Amazon Kiro

Every model received the same requirements. I evaluated them on accuracy, containment, animation quality, and refinement potential.

Results by Model

GPT-5 (Chat UI)

GPT-5 generated workable code, but with serious issues:

Neurons consistently leaked outside shape boundaries.
Connections were extended incorrectly.
Neuron firing lacked polish and synchronization.
Even with multiple refinement prompts, shapes stayed messy.

Using GPT-5 Agents (longer reasoning) prolonged the process but didn’t fix alignment errors — cones inverted, pyramids misplaced, neurons still out of bounds.

Verdict: Ambitious, but ultimately unreliable for this kind of strict geometry.

Claude Sonnet 4 (Chat UI)

Claude performed better than GPT-5:

More structured output and closer adherence to requirements.
Shapes looked cleaner, firing slightly more naturally.

But…

Neurons still escaped boundaries.
Links didn’t always update correctly.
After 5–7 refinement prompts, things improved, but never reached spec.

Verdict: Better discipline than GPT-5, but still not production-ready.

Gemini Flash 2.5 (Chat UI)

When given the prompt, Gemini replied with:

I am unable to generate a complete and functional HTML file with the complex Three.js code you’ve requested. My capabilities are limited to providing information and generating text or images, and I cannot create a fully-coded, executable web page.

Verdict: Refreshingly honest, but ultimately a non-result for this test.

DeepSeek DeepThink (R1) (Chat UI)

Marketed as a “thinking” model, DeepSeek showed the same execution gaps as GPT and Claude:

Out-of-bounds neurons.
Misaligned links.
Firing animations lacked precision.

Reasoning ability didn’t translate into technical correctness here.

Verdict: Same flaws, different branding.

Amazon Kiro (Software Dev Environment)

This was the surprise.

Unlike the chat UIs, Kiro didn’t just dump code. It followed a structured development workflow:

Converted my long prompt into a Markdown requirements doc.
Translated that into a design specification.
Broke it down into a task list.
Then generated the implementation.

The first attempt missed a few requirements. But when I asked:

Are you sure you implemented all the requirements?

Kiro did something no other model did:

It checked its work against the requirements list.
Found the gaps.
Fixed them in the next iteration with a single prompt.

The final result hit about 95% of the full spec.

Verdict: The only LLM-based system in this test that was able to implement all the requirements correctly.

Final Rankings

🥇 Amazon Kiro → Clear winner. Met nearly all requirements with minimal prompting.
🥈 Claude Sonnet 4.0 → Strong attempt, but too much manual refinement needed.
🥉 GPT-5 → Ambitious but messy and inconsistent.
🎭 DeepSeek Think (R1) → Same flaws as GPT/Claude, despite “reasoning” claims.
🚫 Gemini Flash 2.5 → Did not attempt implementation.

Takeaways

Long, detailed prompts expose model weaknesses.
The 1,500-token spec I provided was deliberately precise, and most models struggled to stay within those constraints.
Chat UIs aren’t enough for production-level tasks.
GPT, Claude, DeepSeek, and Gemini behaved like “helpers.” They don’t enforce discipline unless you spoon-feed corrections for hours.
Requirement verification is the difference-maker.
Amazon Kiro stood out because it didn’t just generate code — it cross-checked requirements, found gaps, and corrected them. That’s what made it succeed where others failed.

Closing Thoughts

This experiment convinced me of something bigger:

Future LLMs won’t just be about being “smarter.” They’ll need to adopt structured engineering workflows — breaking problems down, checking against specs, and iterating systematically.

Because in real-world coding, intelligence without verification isn’t enough.

And for now, Kiro is the only LLM-based system I’ve tested that was able to implement all the requirements correctly.

Note: Some of the content is AI-generated.