How AI Code Generation Actually Works (And Where It Breaks Down)
A technical look at how AI generates working applications from text prompts — the 8-stage pipeline, why code healing matters, and what AI still can't do.
Ambuj Agrawal
Founder & CEO
Most AI Code Generation is a Single Prompt
The simplest approach to AI code generation is a single LLM call: take user input, generate code, return it. This is what early tools did. It works for trivial cases — a counter component, a styled card. It falls apart on anything with state management, API calls, or multi-file architecture.
The failure mode is predictable: the LLM generates code that looks correct syntactically but breaks at runtime. A fetch call to a non-existent endpoint. A React component that references a hook defined in a file that was never created. CSS classes that don't match the Tailwind version being used.
We built GenMB to solve this. Here's what actually happens when you type "build me a project management app with kanban boards and team collaboration."
The 8-Stage Pipeline
GenMB's code generation runs through eight discrete stages. Each stage can fail independently, and each has its own error recovery path.
Stage 1: Validate — Check the prompt for feasibility. Reject impossible requests early ("build me a blockchain in the browser" gets flagged). Extract structured intent: features, data models, UI requirements.
Stage 2: Analyze — Determine project architecture. Single-file for simple apps, multi-file for anything with routes, state, or backend integration. Detect which services are needed: does this need a database? Authentication? File storage?
Stage 3: Prepare — Assemble the generation context. This includes the user's prompt, any active plugins (Supabase, Stripe, Clerk), detected services, the chosen framework (Vanilla JS, React, or React+TypeScript), and the project's existing code if this is a refinement.
Stage 4: Generate — The actual LLM call. This uses structured prompts from our generation module — different prompts for single-file apps, multi-file projects, and fullstack applications with backends. Temperature is set to 0.3 for code generation (low creativity, high consistency).
Stage 5: Parse — Extract code from the LLM response. This is harder than it sounds. LLMs produce markdown code blocks with inconsistent delimiters, sometimes nest files incorrectly, or return partial responses on timeout. The parser handles all of these.
Stage 6: Validate and Heal — This is where Code Healer runs. The generated code is checked for syntax errors, missing imports, undefined references, and framework-specific issues. If errors are found, Code Healer edits the code using tool-based passes — it reads the problematic file, makes targeted fixes, and writes it back. If the tool-based approach fails, it falls back to a full JSON-based regeneration of the broken file. This loop runs up to 10 times for single-file projects and 15 times for multi-file.
Stage 7: Enhance — Optional post-processing. PWA manifest generation if requested. SDK injection for detected services (file storage, auth, AI chatbot). Plugin code injection for enabled integrations.
Stage 8: Finalize — Save to Firestore, create a version snapshot, return the result to the frontend.
The entire pipeline typically completes in 20–60 seconds depending on project complexity.
Why Code Healing Matters More Than Better Prompts
You can spend months perfecting generation prompts. We did. But here's the thing: even with a perfect prompt, LLMs produce code with runtime errors roughly 15–25% of the time on complex projects. The error rate scales with the number of files, the complexity of state management, and the number of third-party integrations.
Code Healer exists because generation will never be perfect. Instead of trying to eliminate errors at generation time (impossible), we catch and fix them in a dedicated post-generation stage.
Here's a real example. A user prompted: "Build a recipe app where users can save favorites and generate shopping lists." The LLM generated a React app with a clean UI, but the shopping list aggregation function had a bug — it was using reduce on an array that could be undefined when no recipes were selected. Code Healer caught the runtime error in validation, identified the undefined access, and added a guard clause. The user never saw the error.
This pattern — generate, validate, heal — is fundamentally different from the "hope the LLM gets it right" approach. It's why GenMB can handle multi-file projects with database schemas and authentication flows that would break simpler tools.
What Framework Selection Actually Affects
GenMB supports three frameworks: Vanilla JS, React, and React with TypeScript. This isn't a cosmetic choice — it changes the entire generation pipeline.
Vanilla JS generates a single HTML file with embedded CSS and JavaScript. Dependencies are loaded via ESM CDN imports from esm.sh. This produces the fastest-loading apps and the simplest deployment story. The tradeoff: no component model, no state management library, harder to scale beyond a few hundred lines.
React generates a multi-file project with JSX components, React hooks for state, and React Router for navigation. ESM CDN imports for React and dependencies. Better for complex UIs with many interactive elements — dashboards, CRUD apps, tools with multiple views.
React + TypeScript adds type safety. The LLM generates .tsx files with proper interface definitions, typed props, and typed state. This catches a class of bugs at edit time that JavaScript wouldn't surface until runtime. Best for production apps that will be maintained and extended.
The framework choice also affects Code Healer's behavior. TypeScript projects get type-aware healing — if a component expects a User prop but receives a plain object, the healer generates the missing interface.
The Plugin Injection Problem
GenMB has 51 plugin integrations — Supabase, Stripe, Clerk, Firebase, OpenAI, and dozens more. Each plugin needs to inject initialization code, API keys, and SDK imports into the generated app.
The naive approach is to include all plugin context in the generation prompt. This fails at scale: 51 plugins x average 200 tokens each = 10,000+ tokens of plugin context, most of which is irrelevant to any given app. This wastes context window and confuses the LLM.
Instead, only active plugins inject their context. When a user enables Stripe, the Stripe plugin's prompt template and code snippets are merged into the generation context. The plugin manifests define exact injection points — which scripts go in the HTML head, which initialization code goes in the main JS file, and which configuration values need placeholder replacement.
This is handled by the plugin injector, which wraps injected blocks in markers so they can be stripped and re-injected cleanly on subsequent generations. Without this, plugin code would accumulate with each refinement.
Where AI Code Generation Still Breaks Down
After building and running this pipeline for thousands of generations, here's what AI still can't do well:
Complex state synchronization. Apps that need real-time sync between multiple users — collaborative editors, live dashboards with WebSocket updates — are beyond what a single generation pass can reliably produce. The LLM can scaffold the structure, but the timing and conflict resolution logic usually needs manual refinement.
Performance optimization. AI generates correct code, not fast code. It won't memoize expensive computations, virtualize long lists, or lazy-load routes unless specifically prompted. The Code Healer fixes errors but doesn't optimize — optimization requires understanding the user's scale requirements, which aren't in the prompt.
Design nuance. AI produces functional UI, not beautiful UI. It follows Tailwind patterns competently — proper spacing, responsive breakpoints, consistent color usage — but it doesn't have aesthetic judgment. The Visual Editor exists partly for this reason: users can click elements in the preview and adjust styling without regenerating.
Multi-service orchestration. An app that needs Supabase for data, Stripe for payments, Clerk for auth, and SendGrid for email works if each service is relatively isolated. But when the services need to interact — "send a Stripe receipt email via SendGrid when a Clerk-authenticated user completes a purchase stored in Supabase" — the integration logic often needs manual adjustment.
What's Coming
The pipeline is modular by design. Each of the eight stages can be improved independently. Better models improve Stage 4 (generation). Better parsing heuristics improve Stage 5. More healing strategies improve Stage 6.
The biggest opportunity is in Stage 2 (analysis). Today, the analyzer determines project structure from a single prompt. With plan mode — where the AI reasons through components, data models, and technical tradeoffs before generating — the architecture decisions improve significantly. Plan mode is already available in GenMB and produces measurably better results on complex projects.
The goal isn't to make AI that writes perfect code. It's to make AI that writes good enough code, catches its own mistakes, and gives users the tools to refine the result. That's a solvable problem, and we're getting closer with each iteration.
Frequently Asked Questions
How does AI code generation work?▼
Why does AI-generated code have errors?▼
What is Code Healer in GenMB?▼
Ambuj Agrawal
Founder & CEO
Award-winning AI author and speaker. Building the future of app development at GenMB.
Follow on LinkedIn