LogoMenu

Which LLM Really Handles Complex 3D Coding? A Stress Test You Won’t See in Demos

Case Study Background
Dr Mukhtiar Ahmad

Aug 22, 2025

4min read
#AndroidUpdates# Android15

I gave GPT-5, Claude Sonnet 4, Gemini Flash 2.5, DeepSeek DeepThink (R1), and Amazon Kiro the same complex coding task: a 3D neural network simulation in Three.js. Only one model came close to production-level quality.

Why I Ran This Experiment

Most LLM reviews show simple tasks: “write a landing page” or “generate a Python script.” But those don’t really tell us how these models behave under real production-like complexity. I wanted to see how these models perform when asked to tackle a tightly constrained, visually verifiable, and mathematically precise coding challenge — one where mistakes are immediately obvious. So I wrote a prompt about 1,500 tokens long, spelling out every requirement of a 3D neural network visualization in Three.js. Dashboard

The article below includes only a structured summary of those requirements (for readability). But make no mistake: the actual spec was meticulous, covering exact math formulas, timing phases, firing rules, and strict boundary enforcement.

The Benchmark Problem

Task (summary of original prompt):

Create a professional 3D neural network visualization using Three.js that demonstrates dynamic shape morphing, neural activity simulation, interactive camera controls, and perfect geometric containment of all neural elements.


Highlights of the Requirements

🔹 Shapes & Morphing

  • Cycle through 5 shapes in strict order: Sphere → Cone → Cube → Pyramid → Cylinder
  • Smooth morphing transitions, infinite loop

🔹 Neurons & Connections

  • Exactly 1,200 neurons (white wireframe spheres)
  • At least 3 connections per neuron, proximity-based
  • Zero tolerance for neurons or links outside the shape boundary

🔹 Activity Simulation

  • Random neurons fire every 150ms (~1.5%)
  • Red “meteor effect” travels along links.
  • Destination neuron glows red and scales 3x for 800ms, then resets.

🔹Camera & Interaction

  • Interactive rotation with mouse; gentle auto-rotation fallback.
  • Cinematic zoom-in (3s) and zoom-out (3s), synced with morph transitions.
  • Smooth easing, no jerky movements.

🔹Performance

  • 60fps target.
  • Memory-efficient cleanup of effects.
  • Responsive resizing.
  • Single HTML file with Three.js via CDN.

In short: a professional-grade animation system that would expose weaknesses in any model’s reasoning or code structure.

The Test Setup

I tested five LLMs:

  • Chat interfaces: GPT-5, Claude Sonnet 4, Gemini Flash 2.5, DeepSeek DeepThink (R1)
  • Software development environment: Amazon Kiro

Every model received the same requirements. I evaluated them on accuracy, containment, animation quality, and refinement potential.


Results by Model

GPT-5 (Chat UI)

GPT-5 generated workable code, but with serious issues:

  • Neurons consistently leaked outside shape boundaries.
  • Connections were extended incorrectly.
  • Neuron firing lacked polish and synchronization.
  • Even with multiple refinement prompts, shapes stayed messy.

Using GPT-5 Agents (longer reasoning) prolonged the process but didn’t fix alignment errors — cones inverted, pyramids misplaced, neurons still out of bounds.

Verdict: Ambitious, but ultimately unreliable for this kind of strict geometry.

Claude Sonnet 4 (Chat UI)

Claude performed better than GPT-5:

  • More structured output and closer adherence to requirements.
  • Shapes looked cleaner, firing slightly more naturally.

But…

  • Neurons still escaped boundaries.
  • Links didn’t always update correctly.
  • After 5–7 refinement prompts, things improved, but never reached spec.

Verdict: Better discipline than GPT-5, but still not production-ready.


Gemini Flash 2.5 (Chat UI)

When given the prompt, Gemini replied with:

I am unable to generate a complete and functional HTML file with the complex Three.js code you’ve requested. My capabilities are limited to providing information and generating text or images, and I cannot create a fully-coded, executable web page.

Verdict: Refreshingly honest, but ultimately a non-result for this test.

DeepSeek DeepThink (R1) (Chat UI)

Marketed as a “thinking” model, DeepSeek showed the same execution gaps as GPT and Claude:

  • Out-of-bounds neurons.
  • Misaligned links.
  • Firing animations lacked precision.

Reasoning ability didn’t translate into technical correctness here.

Verdict: Same flaws, different branding.


Amazon Kiro (Software Dev Environment)

This was the surprise.

Unlike the chat UIs, Kiro didn’t just dump code. It followed a structured development workflow:

  1. Converted my long prompt into a Markdown requirements doc.
  2. Translated that into a design specification.
  3. Broke it down into a task list.
  4. Then generated the implementation.

The first attempt missed a few requirements. But when I asked:

Are you sure you implemented all the requirements?

Kiro did something no other model did:

  • It checked its work against the requirements list.
  • Found the gaps.
  • Fixed them in the next iteration with a single prompt.

The final result hit about 95% of the full spec.

Verdict: The only LLM-based system in this test that was able to implement all the requirements correctly.

Final Rankings

🥇 Amazon Kiro → Clear winner. Met nearly all requirements with minimal prompting.
🥈 Claude Sonnet 4.0 → Strong attempt, but too much manual refinement needed.
🥉 GPT-5 → Ambitious but messy and inconsistent.
🎭 DeepSeek Think (R1) → Same flaws as GPT/Claude, despite “reasoning” claims.
🚫 Gemini Flash 2.5 → Did not attempt implementation.


Takeaways

  1. Long, detailed prompts expose model weaknesses.
    The 1,500-token spec I provided was deliberately precise, and most models struggled to stay within those constraints.

  2. Chat UIs aren’t enough for production-level tasks.
    GPT, Claude, DeepSeek, and Gemini behaved like “helpers.” They don’t enforce discipline unless you spoon-feed corrections for hours.

  3. Requirement verification is the difference-maker.
    Amazon Kiro stood out because it didn’t just generate code — it cross-checked requirements, found gaps, and corrected them. That’s what made it succeed where others failed.


Closing Thoughts

This experiment convinced me of something bigger:

Future LLMs won’t just be about being “smarter.” They’ll need to adopt structured engineering workflows — breaking problems down, checking against specs, and iterating systematically.

Because in real-world coding, intelligence without verification isn’t enough.

And for now, Kiro is the only LLM-based system I’ve tested that was able to implement all the requirements correctly.


Note: Some of the content is AI-generated.

Services background

Contact Us

Let’s Build Your Digital Success Story

With decades of expertise and hundreds of future-ready solutions delivered globally, GiganTech combines technical mastery and industry insights to turn complex challenges into growth. Partner with a team trusted by enterprises worldwide—where technology meets innovation.

LogoWhatsApp+1 (302) 610-9522+1 (302) 610-9522

GiganTech, LLC Delaware, USA

  • Agriculture
  • Consumer Products
  • Education
  • Energy and Utilities
  • Rail Automation
  • Surveillance Systems
  • AI & ML Development
  • Cloud Engineering
  • Embedded Systems
  • ISO Certification
  • Mobile App Development
  • Web Development
Nvidia Logo

Copyright © 2025, GiganTech. All rights reserved.