Agentic engineering: how we actually build headless storefronts with AI

Shopify quietly shipped something significant in their Winter 2026 release: Dev MCP. It lets AI coding tools — Cursor, Claude, whatever you're running — read Hydrogen documentation and Storefront API references natively, without you pasting anything. The AI knows the docs. It can query them in real time while you work.

Small thing. Also a signal about where this is going.

Every headless agency in Scandinavia is using AI tools right now. You can see it in job postings, in how agencies talk about their stack, in the shift toward value-based pricing as AI compresses delivery timelines. The tools are everywhere.

What nobody has written is the honest account of what it actually looks like. Not theory. Not a vendor case study. The real task breakdown: where AI earns its keep in a headless build, where it fails dangerously, and what it means for the engineers and brands on the other side.

The numbers are real, but read them carefully

90% of engineering teams now use AI coding tools, up from 61% a year ago (Jellyfish, 2025). Agentic AI adoption inside companies jumped from 51% to 82% in the first five months of 2025 alone. The market for AI code assistants hit $8.14 billion in 2025.

These figures are cited everywhere. They're also mostly useless without context.

The number worth holding: a rigorous randomized controlled trial by METR — not a vendor survey, an actual controlled experiment — found that experienced developers using Cursor and Claude 3.5 Sonnet took 19% longer to complete tasks in mature open-source projects. Even though they reported feeling 20% faster.

That gap between perceived and actual productivity explains why 67% of developers say they spend more time debugging AI-generated code than writing code manually (Harness, 2025). It explains why 59% admit to shipping AI code they don't fully understand (Clutch, 2025).

AI coding tools are genuinely powerful and highly task-dependent. The productivity gains are real for the right work — and the degradation is real for the wrong work. The whole question is knowing which is which.

What we delegate, and what we don't

In a typical headless storefront build — Centra or Shopify Hydrogen, React frontend, TypeScript throughout — there's a fairly clean line between what AI handles well and what it doesn't.

Where AI earns its keep:

Boilerplate scaffolding. TypeScript interfaces generated from a GraphQL schema. Next.js page structure. API route setup. Component shells from a Figma handoff. This work is mechanical, predictable, and time-consuming. AI does it fast and accurately. A developer reviewing and adjusting is faster than a developer writing from scratch.

Tests. Writing unit tests, integration test stubs, and snapshot tests is exactly the kind of repetitive, pattern-heavy work where AI excels. Teams using it for test generation report roughly double the test coverage relative to time invested.

Documentation. Inline JSDoc comments, README files, API client docs. Genuinely tedious to write well. AI handles it adequately — often better than a developer who's been staring at the same function for three hours.

Third-party integration scaffolding. Adapting API responses from Klarna, Ingrid, Voyado, or Sanity into typed component props is pattern-matching work. AI maps it faster than any developer, as long as you give it the actual API response shape.

Code review assistance. Vercel Agent has been in public beta since October 2025, running automated reviews on PRs before humans look at them. It catches edge cases, flags performance anti-patterns, and surfaces security issues. Not infallible, but a useful first pass.

[INJECT: specific example from a recent project — e.g., how long AI took to scaffold a product listing component vs. how long the team spent on the UX detail and inventory filtering logic afterward]

Where AI fails in our domain:

Complex Centra logic. Market configurations, pricelist resolution, multi-warehouse inventory allocation — this is domain-specific territory where AI hallucinates confidently. The API behavior is documented, but the docs are dense and the edge cases are real. An AI that gets this wrong fails silently until something breaks in production in a specific market at a specific stock level. We don't trust AI suggestions on Centra's core business logic without explicit verification against the actual API.

[INJECT: anonymized example of an AI-generated Centra integration that looked correct and wasn't]

Brand-critical UX work. The scroll behavior on a fashion editorial page. The micro-animation on an add-to-cart interaction. The hover state that makes a product image feel premium. AI can write the code for any scroll behavior you describe. It cannot tell you which one is right for a specific brand's customer.

Performance tuning. Vercel/Oxygen caching strategies, ISR patterns for large product catalogs, edge function placement — this requires understanding not just React and Next.js but how the entire request chain behaves under real traffic. AI gives you plausible-sounding suggestions. Plausible and correct are different things.

Architecture decisions. When a requirement comes in for real-time inventory display across a large catalog, there are multiple technically valid approaches: client-side polling, webhooks into a fast store, Centra's subscription model, a custom edge layer. The right answer depends on catalog size, traffic patterns, and the client's infrastructure. This is judgment work.

The tooling stack, honestly

There's an instinct to pick one AI tool and use it for everything. That instinct is wrong.

Cursor is the primary environment for complex, multi-file work. When building a feature that touches the commerce layer, the component library, and the API integration simultaneously, Cursor's project-wide context is what makes it useful. It's not doing autocomplete — it's reasoning across files. This is why it reached $1 billion ARR while charging double GitHub Copilot's price. People pay for the context depth.

Claude is the reasoning partner. Architectural planning, explaining strange Centra API behavior, working through caching tradeoffs — this is conversational, browser-tab Claude. Paste code and ask "what could go wrong with this implementation?" It catches things a tired developer misses.

GitHub Copilot stays in the loop for fast inline suggestions in familiar territory. Not complex enough to need Cursor's overhead, not ambiguous enough to need Claude's reasoning. Just fast.

Shopify Dev MCP is the new addition that actually changes the workflow. Cursor can now query Hydrogen documentation and Storefront API references as live context, without manual pasting. The model knows the docs while you're working. It reduces hallucinations on Shopify-specific API behavior and cuts the tab-switching needed to verify whether a hook exists.

[INJECT: our actual experience using Dev MCP on a Hydrogen project since Winter 2026 shipped — is it as useful in practice as it is in theory?]

[INJECT: does Centra have any MCP tooling in progress, or is this a Shopify-only story for now? Worth flagging either way]

The failure modes you should know about

The most underreported risk in AI-generated code is package hallucination. A University of Texas and Virginia Tech study analyzed 576,000 code samples across 16 large language models and found that 19.7% of package dependencies were hallucinated — npm packages, Python libraries, references to things that don't exist. Even commercial models like GPT-4 hallucinate at around 5%.

In a headless storefront build, a hallucinated package in a dependency tree isn't just a bug. It's a potential supply chain attack vector if someone registers that name maliciously before you notice.

The mitigations: TypeScript type safety catches many hallucinations at compile time rather than runtime. Contract testing against actual Centra and Shopify API responses — not AI-generated mocks — is mandatory. Any AI-suggested dependency gets verified against the npm registry before it goes into a package.json.

[INJECT: David's or team's take on the code review discipline we've built around AI output — what's the actual process look like]

The deeper failure mode is subtler. Qodo's research describes it as the "red zone" — developers experiencing frequent hallucinations but with low confidence in how to catch them. 76% of developers fall here. The pattern: AI output that looks polished because the code is syntactically correct, formatted well, technically confident in tone. The logic error is underneath. Because the output looks professional, the review is less careful. The bug ships.

The only fix is treating AI output as code that always needs review, not code that sometimes needs review.

What this means for brands building on headless

Speed is about to become table stakes. Every agency will have AI throughput within 18 months. Boilerplate that took days takes hours. Integration scaffolding that took a sprint takes an afternoon. The efficiency gains are real and spreading fast.

What doesn't commoditize is the judgment that sits on top.

For a fashion or lifestyle brand building a headless flagship — on Centra for true multi-market pricing and currency, or on Shopify Hydrogen for Oxygen's edge performance and tight ecosystem integration — the storefront needs to do more than function. It needs to feel like the brand. It needs to perform under peak traffic. It needs to handle the complexity of a real product catalog without breaking the UX.

AI compresses the implementation timeline. It doesn't replace the engineering judgment about what to build, or the brand judgment about how it should feel.

This is why Aino's Discovery search engine exists as a product, not just a feature. AI embedded in the storefront experience — intelligent search, personalized recommendations, merchandising logic — is a different problem from AI as a development tool. The first creates value for the brand's customers. The second creates capacity for the engineers building it. Both matter.

[INJECT: brief context on how Discovery was built and what AI tooling played what role in its development]

The honest roadmap

The next generation of this looks like agentic pipelines connected to CI/CD: automated Lighthouse testing, Web Vitals regression detection, dependency auditing, Centra API contract testing — all triggered on PR open, before a human reviewer looks. Vercel's AI SDK 6 ships tool execution approval specifically for this: human-in-the-loop checkpoints for critical decisions, with AI handling throughput work in between.

MCP will likely extend further into commerce-specific tooling. If Centra exposes MCP-compatible interfaces, AI coding tools could reason about market configurations and pricelist logic with live schema context — dramatically reducing hallucination risk on exactly the work that's most dangerous today.

The question every engineering team will face in the next 24 months isn't whether to use agentic tools. That question is settled. The question is what engineers do with the time those tools create.

The agencies that answer it well — deepening expertise in the domain areas AI can't touch, building sharper judgment on performance and brand and architecture — will compound their differentiation as AI compresses everything else. The ones who treat AI as a replacement for expertise rather than a multiplier of it will find the margin pressure running in both directions.

The best engineers we know aren't the ones letting AI write everything. They're the ones using it to eliminate the work that was always beneath their skills — and spending that reclaimed time on the problems worth actually thinking about.

Back to Journal
.